Recognition: 3 theorem links
· Lean TheoremSWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Pith reviewed 2026-05-15 10:23 UTC · model grok-4.3
The pith
Reinforcement learning on open software evolution data enables LLMs to recover developer reasoning and solve 41% of real GitHub issues.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SWE-RL applies reinforcement learning with a lightweight rule-based similarity reward to open-source software evolution data, allowing a Llama 3 base model to autonomously reconstruct developer reasoning processes; the trained Llama3-SWE-RL-70B model solves 41% of tasks on SWE-bench Verified, the highest reported result for models under 100B parameters and comparable to leading closed models, while also improving performance on five out-of-domain benchmarks including function coding, code reasoning, mathematics, and language understanding.
What carries the argument
SWE-RL, the reinforcement learning procedure that treats the similarity score between LLM-generated patches and ground-truth developer solutions as the sole reward, applied at scale to software evolution records of code changes and events.
If this is right
- LLMs can acquire transferable reasoning skills by learning from the natural evolution of real codebases rather than synthetic problems alone.
- Software engineering data at scale forms an effective domain for reinforcement learning that avoids the performance degradation seen in supervised fine-tuning on out-of-domain tasks.
- A rule-based reward derived from existing project records is sufficient to drive measurable gains on practical issue-resolution benchmarks.
- Medium-sized open models can reach performance levels previously associated only with much larger or proprietary systems when trained via RL on evolution data.
Where Pith is reading between the lines
- The same RL-on-evolution approach could transfer to other domains that maintain long records of incremental changes, such as scientific code or legal document histories.
- Observed generalization across coding, math, and language tasks implies that software evolution data encodes broadly applicable reasoning structures.
- Future scaling of this method would benefit from testing whether richer reward signals, such as test-suite execution outcomes, produce further gains beyond the similarity metric.
- Open-source project histories offer a low-cost, continually growing data source that could reduce dependence on curated reasoning datasets.
Load-bearing premise
A simple similarity score between generated and ground-truth code solutions supplies a reward that teaches real reasoning instead of surface-level pattern matching.
What would settle it
Demonstration that high benchmark scores arise from memorizing training-project patterns rather than solving previously unseen issues, or that removing the RL stage and using only supervised fine-tuning on the same data yields equal or better results on SWE-bench Verified.
read the original abstract
The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWE-RL, the first RL-based approach to scale LLM reasoning on real-world software engineering by training on open-source software evolution data (code snapshots, changes, issues, PRs) using a lightweight rule-based similarity reward between ground-truth and generated patches. It reports that Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified, claimed as the best result for models under 100B parameters and comparable to GPT-4o, while also exhibiting emergent generalization to five out-of-domain tasks (function coding, library use, code reasoning, mathematics, general language understanding) where a supervised fine-tuning baseline degrades performance.
Significance. If the results are robust, this would be a significant contribution to the field by demonstrating that RL on massive software evolution corpora can recover developer-like reasoning and produce generalized capabilities beyond the training distribution, extending RL successes from math/coding to practical SE and opening a scalable data source for future work.
major comments (3)
- [Abstract] Abstract: The central performance claim (41.0% on SWE-bench Verified) and the generalization result both rest on the unverified assumption that a lightweight rule-based similarity score between generated and ground-truth patches constitutes an effective RL reward for genuine reasoning; without the exact definition of this score, its scaling, or any execution/test-passing component, it is impossible to rule out that gains arise from lexical/template matching rather than causal understanding of the issues.
- [Method and Experiments] Method and Experiments sections: No ablation studies, reward variants, or comparisons to execution-based rewards are reported, nor are details provided on data curation, training dynamics, or statistical significance of the 41.0% result (e.g., variance across seeds or tests against baselines); these omissions make the out-of-domain generalization claim load-bearing yet unsupported.
- [Abstract and Results] Abstract and Results: The claim that SWE-RL yields 'generalized reasoning skills' while SFT degrades performance requires explicit controls showing that the RL objective (rather than data volume or model scale) drives the difference; the current presentation leaves open the possibility that the observed pattern is an artifact of the particular training setup.
minor comments (2)
- [Abstract] Abstract: The statement 'to our knowledge, this is the best performance reported for medium-sized (<100B) LLMs' should cite the specific competing models and their reported scores for direct comparison.
- [Method] Throughout: Clarify whether the similarity reward operates on raw patches, normalized diffs, or AST-level representations, as this affects reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with plans to revise the manuscript accordingly to improve clarity and support for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim (41.0% on SWE-bench Verified) and the generalization result both rest on the unverified assumption that a lightweight rule-based similarity score between generated and ground-truth patches constitutes an effective RL reward for genuine reasoning; without the exact definition of this score, its scaling, or any execution/test-passing component, it is impossible to rule out that gains arise from lexical/template matching rather than causal understanding of the issues.
Authors: We appreciate the referee pointing this out. The current manuscript describes the reward only at a high level as a lightweight rule-based similarity score (e.g., between ground-truth and generated patches) without providing the exact formulation, scaling details, or explicit comparison to execution-based alternatives. We will revise the abstract and add a precise definition plus computation details in the Methods section, along with a discussion of its design choices and limitations (including the lack of test-passing verification). This will allow readers to better assess whether the gains reflect reasoning or other factors, while retaining the scalable nature of the approach. revision: yes
-
Referee: [Method and Experiments] Method and Experiments sections: No ablation studies, reward variants, or comparisons to execution-based rewards are reported, nor are details provided on data curation, training dynamics, or statistical significance of the 41.0% result (e.g., variance across seeds or tests against baselines); these omissions make the out-of-domain generalization claim load-bearing yet unsupported.
Authors: We agree that the current version lacks these elements, which weakens the support for the generalization results. In the revised manuscript we will add ablation studies on reward components and variants, include comparisons to execution-based rewards where feasible, expand the data curation and training dynamics descriptions, and report statistical significance including variance across seeds for the main SWE-bench result and baseline comparisons. revision: yes
-
Referee: [Abstract and Results] Abstract and Results: The claim that SWE-RL yields 'generalized reasoning skills' while SFT degrades performance requires explicit controls showing that the RL objective (rather than data volume or model scale) drives the difference; the current presentation leaves open the possibility that the observed pattern is an artifact of the particular training setup.
Authors: The SFT baseline was constructed using the identical data volume, model scale, and initialization as the RL run to isolate the objective. We will revise the abstract and Results section to state this control explicitly, add supporting analysis of training dynamics, and discuss why the observed pattern (RL improvement vs. SFT degradation) is attributable to the RL objective rather than other setup factors. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper trains an RL model using a rule-based similarity reward computed directly against ground-truth patches drawn from open-source evolution data, then evaluates the resulting model on the independent, human-verified SWE-bench Verified benchmark and on separate out-of-domain tasks. No equations, fitted parameters, or self-citations are presented that reduce the claimed performance gains or generalization claims to the training reward by construction. The derivation therefore remains self-contained: standard RL is applied to external data, with results measured on held-out benchmarks rather than through self-referential definitions or renamed fits.
Axiom & Free-Parameter Ledger
free parameters (2)
- RL training hyperparameters
- Reward scaling coefficient
axioms (2)
- domain assumption Similarity score between generated and ground-truth solutions is a valid proxy for reasoning quality
- domain assumption Open-source software evolution data contains generalizable reasoning signals
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer’s reasoning processes...
-
Foundation.LawOfExistencedefect_zero_iff_one contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
the reward is a similarity score (between 0 and 1) of the predicted and the oracle patch calculated by Python’s difflib.SequenceMatcher
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
-
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
-
Agentic Discovery of Exchange-Correlation Density Functionals
An agentic LLM system discovers the XC functional SAFS26-a that improves on the ωB97M-V baseline by roughly 9% on a held-out thermochemistry dataset while warning that such systems can exploit unphysical shortcuts.
-
Faithful Mobile GUI Agents with Guided Advantage Estimator
Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Group-in-Group Policy Optimization for LLM Agent Training
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
-
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
-
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
-
ARuleCon: Agentic Security Rule Conversion
ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
-
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
-
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
-
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
-
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.
Reference graph
Works this paper leans on
-
[1]
Claude 3.5 sonnet model card addendum
Anthropic. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 2024 a
work page 2024
-
[2]
Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet
Anthropic. Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet . https://www.anthropic.com/research/swe-bench-sonnet, 2024 b
work page 2024
-
[4]
Codet: Code generation with generated tests
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=ktrw68Cmu9c
work page 2023
-
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[6]
Meta large language model compiler: Foundation models of compiler optimization, 2024
Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. Meta large language model compiler: Foundation models of compiler optimization, 2024. https://arxiv.org/abs/2407.02524
-
[7]
DeepSeek-AI. Deepseek-v3 technical report, 2024. https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...
-
[10]
Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023
work page 2023
-
[11]
Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion
Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...
work page 2023
-
[13]
AI@Meta: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, and Angela Fan et al. The llama 3 herd of models, 2024. https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Aider is ai pair programming in your terminal
Paul Gauthier. Aider is ai pair programming in your terminal. https://aider.chat/, 2024
work page 2024
-
[15]
Rlef: Grounding code llms in execution feedback with reinforcement learning
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089
-
[16]
GitHub. Github rest api documentation. https://docs.github.com/en/rest?apiVersion=2022-11-28, 2022. Accessed: 2025-02-24
work page 2022
-
[17]
GitHub. Github. https://github.com, 2025
work page 2025
-
[18]
Ilya Grigorik. Gh archive. https://www.gharchive.org/, 2025. Accessed: 2025-02-23
work page 2025
-
[19]
CRUXE val: A benchmark for code reasoning, understanding and execution
Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. CRUXE val: A benchmark for code reasoning, understanding and execution. In Proceedings of the 41st ICML, volume 235 of Proceedings of Machine Learning Research, pages 16568--16621. PMLR, 21--27 Jul 2024. https://proceedings.mlr.press/v235/gu24c.html
work page 2024
-
[20]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024
work page 2024
-
[21]
Measuring coding challenge competence with apps, 2021 a
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps, 2021 a
work page 2021
-
[22]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021 b . https://openreview.net/forum?id=d7KBjmI3GmQ
work page 2021
-
[23]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021 c . https://openreview.net/forum?id=7Bywt2mQsCe
work page 2021
-
[24]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. https://arxiv.org/abs/2...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Testgeneval: A real world unit test generation and test completion benchmark, 2024 a
Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test generation and test completion benchmark, 2024 a . https://arxiv.org/abs/2410.00752
-
[26]
Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024 b
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024 b
work page 2024
-
[27]
Impact of code language models on automated program repair, 2023
Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair, 2023
work page 2023
-
[28]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2023
work page 2023
-
[29]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench leaderboard. https://www.swebench.com, 2024. Accessed: 2025-02-04
work page 2024
-
[30]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. https://arxiv.org/abs/1412.6980
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919--931. IEEE, 2023
work page 2023
-
[32]
Starcoder: may the source be with you!, 2023
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...
work page 2023
-
[34]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=1qvx610Cu7
work page 2023
-
[35]
Evaluating language models for efficient code generation
Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024 a . https://openreview.net/forum?id=IBCBMeAhmC
work page 2024
-
[36]
Repobench: Benchmarking repository-level code auto-completion systems
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations, 2024 b . https://openreview.net/forum?id=pPjZIOuQuF
work page 2024
-
[37]
Starcoder 2 and the stack v2: The next generation, 2024
Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...
work page 2024
-
[38]
Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. Lingma swe-gpt: An open development-process-centric language model for automated software improvement, 2024. https://arxiv.org/abs/2411.00622
-
[39]
OpenAI. Gpt-4o system card, 2024 a . https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
OpenAI. Openai o1 system card, 2024 b . https://arxiv.org/abs/2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
OpenAI . simple-evals. https://github.com/openai/simple-evals, 2024. Accessed: 2025-02-23
work page 2024
-
[42]
Introducing swe-bench verified
OpenAI. Introducing swe-bench verified. https://openai.com/index/introducing-swe-bench-verified, 2024. Accessed: 2025-02-04
work page 2024
-
[43]
Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv
Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. https://arxiv.org/abs/2412.21139
-
[44]
John W. Ratcliff and David E. Metzener. Pattern Matching: The Gestalt Approach . Dr. Dobb's Journal, page 46, July 1988. https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
-
[45]
Code llama: Open foundation models for code, 2023
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thoma...
work page 2023
-
[46]
An empirical evaluation of using large language models for automated unit test generation
Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 2023
work page 2023
-
[47]
John Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html, 2020. Accessed: 2025-02-22
work page 2020
-
[48]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Sida I. Wang, Alex Gu, Lovish Madaan, Dieuwke Hupkes, Jiawei Liu, Yuxiang Wei, Naman Jain, Yuhang Lai, Sten Sootla, Ofir Press, Baptiste Rozière, and Gabriel Synnaeve. E val- A rena: noise and errors on llm evaluations. https://github.com/crux-eval/eval-arena, 2024 a
work page 2024
-
[50]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Chi, Quoc V Le, and Denny Zhou
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=_V...
work page 2022
-
[52]
Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair, 2023
work page 2023
-
[53]
Magicoder: Empowering code generation with OSS -instruct
Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with OSS -instruct. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 52632--52657. PMLR, 21--27 Jul 2024. https://proceedings.mlr.press/v235/wei24h.html
work page 2024
-
[54]
Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022
work page 2022
-
[55]
Universal fuzzing via large language models, 2023 a
Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Universal fuzzing via large language models, 2023 a
work page 2023
-
[56]
Automated program repair in the era of large pre-trained language models
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482--1494, 2023 b . doi:10.1109/ICSE48619.2023.00129
-
[57]
Agentless: Demystifying LLM-based Software Engineering Agents
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. https://arxiv.org/abs/2407.01489
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025
Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen. Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025
work page 2025
-
[59]
Qwen2.5 technical report, 2024 a
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page 2024
-
[60]
SWE -agent: Agent-computer interfaces enable automated software engineering
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 b . https://openreview.net/forum?id=mXpq6ut8J3
work page 2024
-
[61]
John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024 c . https://arxiv.org/abs/2410.03859
-
[62]
Demystifying long chain-of-thought reasoning in llms, 2025
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. https://arxiv.org/abs/2502.03373
-
[64]
Acecoder: Acing coder rl via automated test-case synthesis
Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. ArXiv, 2502.01718, 2025
-
[65]
Repocoder: Repository-level code completion through iterative retrieval and generation, 2023
Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023
work page 2023
-
[66]
Autocoderover: Autonomous program improvement, 2024
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement, 2024. https://arxiv.org/abs/2404.05427
-
[67]
Commit0: Library generation from scratch, 2024
Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, and Alexander M Rush. Commit0: Library generation from scratch, 2024. https://arxiv.org/abs/2412.01769
-
[69]
Albert Örwall. Moatless tools. https://github.com/aorwall/moatless-tools, 2024
work page 2024
- [70]
-
[71]
WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. 2023 , eprint=
work page 2023
-
[72]
Teaching Large Language Models to Self-Debug , author=. 2023 , eprint=
work page 2023
-
[73]
Foundations and Trends® in Programming Languages , title =. 2017 , volume =. doi:10.1561/2500000010 , issn =
-
[74]
Is Your Code Generated by Chat
Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =
work page 2023
-
[75]
The Eleventh International Conference on Learning Representations , year=
CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. The Eleventh International Conference on Learning Representations , year=
-
[76]
Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C.H. C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.685
-
[77]
Amazon Web Services , title =
-
[78]
Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair , author=. 2023 , eprint=
work page 2023
-
[79]
GitHub repository , howpublished =
Sahil Chaudhary , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[80]
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
work page 2023
-
[81]
Finetuned Language Models Are Zero-Shot Learners , author=. 2022 , eprint=
work page 2022
-
[82]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
arXiv preprint arXiv:2306.08568 , year=
WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. arXiv preprint arXiv:2306.08568 , year=
-
[84]
PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=
work page 2022
-
[85]
Training Compute-Optimal Large Language Models , author=. 2022 , eprint=
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.