pith. machine review for the scientific record. sign in

arxiv: 2502.18449 · v2 · submitted 2025-02-25 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords reinforcement learninglarge language modelssoftware engineeringreasoningcode generationsoftware evolutionSWE-bench
0
0 comments X

The pith

Reinforcement learning on open software evolution data enables LLMs to recover developer reasoning and solve 41% of real GitHub issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-RL to apply reinforcement learning directly to the full record of software projects, including code snapshots, changes, issues, and pull requests. It uses a simple rule-based similarity score between model outputs and actual developer solutions as the reward signal. This setup lets the model learn from massive real-world data rather than curated math or coding problems. The resulting 70B model reaches 41% on the human-verified SWE-bench, matching some larger proprietary systems, and shows gains on unrelated tasks where supervised training hurts performance. The central argument is that software evolution supplies a scalable training signal for genuine reasoning improvements.

Core claim

SWE-RL applies reinforcement learning with a lightweight rule-based similarity reward to open-source software evolution data, allowing a Llama 3 base model to autonomously reconstruct developer reasoning processes; the trained Llama3-SWE-RL-70B model solves 41% of tasks on SWE-bench Verified, the highest reported result for models under 100B parameters and comparable to leading closed models, while also improving performance on five out-of-domain benchmarks including function coding, code reasoning, mathematics, and language understanding.

What carries the argument

SWE-RL, the reinforcement learning procedure that treats the similarity score between LLM-generated patches and ground-truth developer solutions as the sole reward, applied at scale to software evolution records of code changes and events.

If this is right

  • LLMs can acquire transferable reasoning skills by learning from the natural evolution of real codebases rather than synthetic problems alone.
  • Software engineering data at scale forms an effective domain for reinforcement learning that avoids the performance degradation seen in supervised fine-tuning on out-of-domain tasks.
  • A rule-based reward derived from existing project records is sufficient to drive measurable gains on practical issue-resolution benchmarks.
  • Medium-sized open models can reach performance levels previously associated only with much larger or proprietary systems when trained via RL on evolution data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same RL-on-evolution approach could transfer to other domains that maintain long records of incremental changes, such as scientific code or legal document histories.
  • Observed generalization across coding, math, and language tasks implies that software evolution data encodes broadly applicable reasoning structures.
  • Future scaling of this method would benefit from testing whether richer reward signals, such as test-suite execution outcomes, produce further gains beyond the similarity metric.
  • Open-source project histories offer a low-cost, continually growing data source that could reduce dependence on curated reasoning datasets.

Load-bearing premise

A simple similarity score between generated and ground-truth code solutions supplies a reward that teaches real reasoning instead of surface-level pattern matching.

What would settle it

Demonstration that high benchmark scores arise from memorizing training-project patterns rather than solving previously unseen issues, or that removing the RL stage and using only supervised fine-tuning on the same data yields equal or better results on SWE-bench Verified.

read the original abstract

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SWE-RL, the first RL-based approach to scale LLM reasoning on real-world software engineering by training on open-source software evolution data (code snapshots, changes, issues, PRs) using a lightweight rule-based similarity reward between ground-truth and generated patches. It reports that Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified, claimed as the best result for models under 100B parameters and comparable to GPT-4o, while also exhibiting emergent generalization to five out-of-domain tasks (function coding, library use, code reasoning, mathematics, general language understanding) where a supervised fine-tuning baseline degrades performance.

Significance. If the results are robust, this would be a significant contribution to the field by demonstrating that RL on massive software evolution corpora can recover developer-like reasoning and produce generalized capabilities beyond the training distribution, extending RL successes from math/coding to practical SE and opening a scalable data source for future work.

major comments (3)
  1. [Abstract] Abstract: The central performance claim (41.0% on SWE-bench Verified) and the generalization result both rest on the unverified assumption that a lightweight rule-based similarity score between generated and ground-truth patches constitutes an effective RL reward for genuine reasoning; without the exact definition of this score, its scaling, or any execution/test-passing component, it is impossible to rule out that gains arise from lexical/template matching rather than causal understanding of the issues.
  2. [Method and Experiments] Method and Experiments sections: No ablation studies, reward variants, or comparisons to execution-based rewards are reported, nor are details provided on data curation, training dynamics, or statistical significance of the 41.0% result (e.g., variance across seeds or tests against baselines); these omissions make the out-of-domain generalization claim load-bearing yet unsupported.
  3. [Abstract and Results] Abstract and Results: The claim that SWE-RL yields 'generalized reasoning skills' while SFT degrades performance requires explicit controls showing that the RL objective (rather than data volume or model scale) drives the difference; the current presentation leaves open the possibility that the observed pattern is an artifact of the particular training setup.
minor comments (2)
  1. [Abstract] Abstract: The statement 'to our knowledge, this is the best performance reported for medium-sized (<100B) LLMs' should cite the specific competing models and their reported scores for direct comparison.
  2. [Method] Throughout: Clarify whether the similarity reward operates on raw patches, normalized diffs, or AST-level representations, as this affects reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with plans to revise the manuscript accordingly to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claim (41.0% on SWE-bench Verified) and the generalization result both rest on the unverified assumption that a lightweight rule-based similarity score between generated and ground-truth patches constitutes an effective RL reward for genuine reasoning; without the exact definition of this score, its scaling, or any execution/test-passing component, it is impossible to rule out that gains arise from lexical/template matching rather than causal understanding of the issues.

    Authors: We appreciate the referee pointing this out. The current manuscript describes the reward only at a high level as a lightweight rule-based similarity score (e.g., between ground-truth and generated patches) without providing the exact formulation, scaling details, or explicit comparison to execution-based alternatives. We will revise the abstract and add a precise definition plus computation details in the Methods section, along with a discussion of its design choices and limitations (including the lack of test-passing verification). This will allow readers to better assess whether the gains reflect reasoning or other factors, while retaining the scalable nature of the approach. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: No ablation studies, reward variants, or comparisons to execution-based rewards are reported, nor are details provided on data curation, training dynamics, or statistical significance of the 41.0% result (e.g., variance across seeds or tests against baselines); these omissions make the out-of-domain generalization claim load-bearing yet unsupported.

    Authors: We agree that the current version lacks these elements, which weakens the support for the generalization results. In the revised manuscript we will add ablation studies on reward components and variants, include comparisons to execution-based rewards where feasible, expand the data curation and training dynamics descriptions, and report statistical significance including variance across seeds for the main SWE-bench result and baseline comparisons. revision: yes

  3. Referee: [Abstract and Results] Abstract and Results: The claim that SWE-RL yields 'generalized reasoning skills' while SFT degrades performance requires explicit controls showing that the RL objective (rather than data volume or model scale) drives the difference; the current presentation leaves open the possibility that the observed pattern is an artifact of the particular training setup.

    Authors: The SFT baseline was constructed using the identical data volume, model scale, and initialization as the RL run to isolate the objective. We will revise the abstract and Results section to state this control explicitly, add supporting analysis of training dynamics, and discuss why the observed pattern (RL improvement vs. SFT degradation) is attributable to the RL objective rather than other setup factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper trains an RL model using a rule-based similarity reward computed directly against ground-truth patches drawn from open-source evolution data, then evaluates the resulting model on the independent, human-verified SWE-bench Verified benchmark and on separate out-of-domain tasks. No equations, fitted parameters, or self-citations are presented that reduce the claimed performance gains or generalization claims to the training reward by construction. The derivation therefore remains self-contained: standard RL is applied to external data, with results measured on held-out benchmarks rather than through self-referential definitions or renamed fits.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that software evolution data encodes transferable reasoning patterns and that a simple similarity-based reward is sufficient to elicit them via RL; no invented entities are introduced.

free parameters (2)
  • RL training hyperparameters
    Standard parameters such as learning rate and batch size are required for the RL process but not detailed in the abstract.
  • Reward scaling coefficient
    Scaling applied to the similarity score reward to stabilize training.
axioms (2)
  • domain assumption Similarity score between generated and ground-truth solutions is a valid proxy for reasoning quality
    Directly used as the lightweight rule-based reward signal.
  • domain assumption Open-source software evolution data contains generalizable reasoning signals
    Basis for training on lifecycle records including code changes and events.

pith-pipeline@v0.9.0 · 5642 in / 1557 out tokens · 56161 ms · 2026-05-15T10:23:06.408182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer’s reasoning processes...

  • Foundation.LawOfExistence defect_zero_iff_one contradicts
    ?
    contradicts

    CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

    the reward is a similarity score (between 0 and 1) of the predicted and the oracle patch calculated by Python’s difflib.SequenceMatcher

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

    cs.SE 2026-05 conditional novelty 7.0

    10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

  2. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  3. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 7.0

    BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.

  4. RewardHarness: Self-Evolving Agentic Post-Training

    cs.AI 2026-05 unverdicted novelty 7.0

    RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.

  5. Agentic Discovery of Exchange-Correlation Density Functionals

    cs.AI 2026-05 conditional novelty 7.0

    An agentic LLM system discovers the XC functional SAFS26-a that improves on the ωB97M-V baseline by roughly 9% on a held-out thermochemistry dataset while warning that such systems can exploit unphysical shortcuts.

  6. Faithful Mobile GUI Agents with Guided Advantage Estimator

    cs.AI 2026-05 unverdicted novelty 7.0

    Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.

  7. ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

    cs.SE 2026-04 unverdicted novelty 7.0

    ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...

  8. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  9. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  10. Revisiting DAgger in the Era of LLM-Agents

    cs.LG 2026-05 conditional novelty 6.0

    DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.

  11. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.

  12. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.

  13. Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 6.0

    REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.

  14. CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora

    cs.SE 2026-04 unverdicted novelty 6.0

    CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.

  15. AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.

  16. TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

    cs.DC 2026-04 unverdicted novelty 6.0

    TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

  17. ARuleCon: Agentic Security Rule Conversion

    cs.CR 2026-04 unverdicted novelty 6.0

    ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.

  18. Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

    cs.LG 2026-04 unverdicted novelty 6.0

    Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

  19. Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis

    cs.SE 2026-04 unverdicted novelty 5.0

    SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.

  20. Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures

    cs.SE 2026-04 unverdicted novelty 4.0

    Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.

  21. Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation

    cs.IR 2026-04 unverdicted novelty 4.0

    Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.

Reference graph

Works this paper leans on

192 extracted references · 192 canonical work pages · cited by 19 Pith papers · 14 internal anchors

  1. [1]

    Claude 3.5 sonnet model card addendum

    Anthropic. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 2024 a

  2. [2]

    Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

    Anthropic. Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet . https://www.anthropic.com/research/swe-bench-sonnet, 2024 b

  3. [4]

    Codet: Code generation with generated tests

    Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=ktrw68Cmu9c

  4. [5]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  5. [6]

    Meta large language model compiler: Foundation models of compiler optimization, 2024

    Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. Meta large language model compiler: Foundation models of compiler optimization, 2024. https://arxiv.org/abs/2407.02524

  6. [7]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. Deepseek-v3 technical report, 2024. https://arxiv.org/abs/2412.19437

  7. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948

  8. [9]

    DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...

  9. [10]

    Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023

    Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023

  10. [11]

    Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion

    Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...

  11. [13]

    The Llama 3 Herd of Models

    AI@Meta: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, and Angela Fan et al. The llama 3 herd of models, 2024. https://arxiv.org/abs/2407.21783

  12. [14]

    Aider is ai pair programming in your terminal

    Paul Gauthier. Aider is ai pair programming in your terminal. https://aider.chat/, 2024

  13. [15]

    Rlef: Grounding code llms in execution feedback with reinforcement learning

    Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089

  14. [16]

    Github rest api documentation

    GitHub. Github rest api documentation. https://docs.github.com/en/rest?apiVersion=2022-11-28, 2022. Accessed: 2025-02-24

  15. [17]

    GitHub. Github. https://github.com, 2025

  16. [18]

    Gh archive

    Ilya Grigorik. Gh archive. https://www.gharchive.org/, 2025. Accessed: 2025-02-23

  17. [19]

    CRUXE val: A benchmark for code reasoning, understanding and execution

    Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. CRUXE val: A benchmark for code reasoning, understanding and execution. In Proceedings of the 41st ICML, volume 235 of Proceedings of Machine Learning Research, pages 16568--16621. PMLR, 21--27 Jul 2024. https://proceedings.mlr.press/v235/gu24c.html

  18. [20]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024

  19. [21]

    Measuring coding challenge competence with apps, 2021 a

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps, 2021 a

  20. [22]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021 b . https://openreview.net/forum?id=d7KBjmI3GmQ

  21. [23]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021 c . https://openreview.net/forum?id=7Bywt2mQsCe

  22. [24]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. https://arxiv.org/abs/2...

  23. [25]

    Testgeneval: A real world unit test generation and test completion benchmark, 2024 a

    Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test generation and test completion benchmark, 2024 a . https://arxiv.org/abs/2410.00752

  24. [26]

    Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024 b

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024 b

  25. [27]

    Impact of code language models on automated program repair, 2023

    Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair, 2023

  26. [28]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2023

  27. [29]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench leaderboard. https://www.swebench.com, 2024. Accessed: 2025-02-04

  28. [30]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. https://arxiv.org/abs/1412.6980

  29. [31]

    Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919--931. IEEE, 2023

  30. [32]

    Starcoder: may the source be with you!, 2023

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

  31. [34]

    Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=1qvx610Cu7

  32. [35]

    Evaluating language models for efficient code generation

    Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024 a . https://openreview.net/forum?id=IBCBMeAhmC

  33. [36]

    Repobench: Benchmarking repository-level code auto-completion systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations, 2024 b . https://openreview.net/forum?id=pPjZIOuQuF

  34. [37]

    Starcoder 2 and the stack v2: The next generation, 2024

    Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

  35. [38]

    Lingma swe-gpt: An open development-process-centric language model for automated software improvement, 2024

    Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. Lingma swe-gpt: An open development-process-centric language model for automated software improvement, 2024. https://arxiv.org/abs/2411.00622

  36. [39]

    GPT-4o System Card

    OpenAI. Gpt-4o system card, 2024 a . https://arxiv.org/abs/2410.21276

  37. [40]

    OpenAI o1 System Card

    OpenAI. Openai o1 system card, 2024 b . https://arxiv.org/abs/2412.16720

  38. [41]

    simple-evals

    OpenAI . simple-evals. https://github.com/openai/simple-evals, 2024. Accessed: 2025-02-23

  39. [42]

    Introducing swe-bench verified

    OpenAI. Introducing swe-bench verified. https://openai.com/index/introducing-swe-bench-verified, 2024. Accessed: 2025-02-04

  40. [43]

    Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. https://arxiv.org/abs/2412.21139

  41. [44]

    Ratcliff and David E

    John W. Ratcliff and David E. Metzener. Pattern Matching: The Gestalt Approach . Dr. Dobb's Journal, page 46, July 1988. https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

  42. [45]

    Code llama: Open foundation models for code, 2023

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thoma...

  43. [46]

    An empirical evaluation of using large language models for automated unit test generation

    Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 2023

  44. [47]

    Approximating kl divergence

    John Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html, 2020. Accessed: 2025-02-22

  45. [48]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

  46. [49]

    Wang, Alex Gu, Lovish Madaan, Dieuwke Hupkes, Jiawei Liu, Yuxiang Wei, Naman Jain, Yuhang Lai, Sten Sootla, Ofir Press, Baptiste Rozière, and Gabriel Synnaeve

    Sida I. Wang, Alex Gu, Lovish Madaan, Dieuwke Hupkes, Jiawei Liu, Yuxiang Wei, Naman Jain, Yuhang Lai, Sten Sootla, Ofir Press, Baptiste Rozière, and Gabriel Synnaeve. E val- A rena: noise and errors on llm evaluations. https://github.com/crux-eval/eval-arena, 2024 a

  47. [50]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  48. [51]

    Chi, Quoc V Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=_V...

  49. [52]

    Copiloting the copilots: Fusing large language models with completion engines for automated program repair, 2023

    Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair, 2023

  50. [53]

    Magicoder: Empowering code generation with OSS -instruct

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with OSS -instruct. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 52632--52657. PMLR, 21--27 Jul 2024. https://proceedings.mlr.press/v235/wei24h.html

  51. [54]

    Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022

    Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022

  52. [55]

    Universal fuzzing via large language models, 2023 a

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Universal fuzzing via large language models, 2023 a

  53. [56]

    Automated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482--1494, 2023 b . doi:10.1109/ICSE48619.2023.00129

  54. [57]

    Agentless: Demystifying LLM-based Software Engineering Agents

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. https://arxiv.org/abs/2407.01489

  55. [58]

    Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025

    Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen. Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025

  56. [59]

    Qwen2.5 technical report, 2024 a

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  57. [60]

    SWE -agent: Agent-computer interfaces enable automated software engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 b . https://openreview.net/forum?id=mXpq6ut8J3

  58. [61]

    Jimenez, Alex L

    John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024 c . https://arxiv.org/abs/2410.03859

  59. [62]

    Demystifying long chain-of-thought reasoning in llms, 2025

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. https://arxiv.org/abs/2502.03373

  60. [64]

    Acecoder: Acing coder rl via automated test-case synthesis

    Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. ArXiv, 2502.01718, 2025

  61. [65]

    Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

  62. [66]

    Autocoderover: Autonomous program improvement, 2024

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement, 2024. https://arxiv.org/abs/2404.05427

  63. [67]

    Commit0: Library generation from scratch, 2024

    Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, and Alexander M Rush. Commit0: Library generation from scratch, 2024. https://arxiv.org/abs/2412.01769

  64. [69]

    Moatless tools

    Albert Örwall. Moatless tools. https://github.com/aorwall/moatless-tools, 2024

  65. [70]

    2023 , eprint=

    Code Llama: Open Foundation Models for Code , author=. 2023 , eprint=

  66. [71]

    2023 , eprint=

    WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. 2023 , eprint=

  67. [72]

    2023 , eprint=

    Teaching Large Language Models to Self-Debug , author=. 2023 , eprint=

  68. [73]

    2017 , volume =

    Foundations and Trends® in Programming Languages , title =. 2017 , volume =. doi:10.1561/2500000010 , issn =

  69. [74]

    Is Your Code Generated by Chat

    Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =

  70. [75]

    The Eleventh International Conference on Learning Representations , year=

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. The Eleventh International Conference on Learning Representations , year=

  71. [76]

    C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

    Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C.H. C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.685

  72. [77]

    Amazon Web Services , title =

  73. [78]

    2023 , eprint=

    Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair , author=. 2023 , eprint=

  74. [79]

    GitHub repository , howpublished =

    Sahil Chaudhary , title =. GitHub repository , howpublished =. 2023 , publisher =

  75. [80]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  76. [81]

    2022 , eprint=

    Finetuned Language Models Are Zero-Shot Learners , author=. 2022 , eprint=

  77. [82]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

  78. [83]

    arXiv preprint arXiv:2306.08568 , year=

    WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. arXiv preprint arXiv:2306.08568 , year=

  79. [84]

    2022 , eprint=

    PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

  80. [85]

    2022 , eprint=

    Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

Showing first 80 references.