Recognition: 1 theorem link
· Lean TheoremSWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Pith reviewed 2026-05-12 13:43 UTC · model grok-4.3
The pith
SWE-Bench Pro introduces 1,865 human-verified problems from 41 repositories to test AI agents on realistic long-horizon software tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce SWE-Bench Pro, a benchmark of 1,865 problems sourced from 41 repositories spanning business applications, B2B services, and developer tools. The problems feature long-horizon tasks that often involve patches across multiple files and substantial code changes, with all tasks human-verified and augmented with sufficient context to ensure they are solvable by skilled engineers. The benchmark is divided into a public set from 11 repositories, a held-out set from 12 repositories, and a commercial set from 18 proprietary repositories under formal agreements; results are released on the commercial set while protecting access to the problems themselves.
What carries the argument
The SWE-Bench Pro benchmark structure, defined by its division into public, held-out, and commercial repository sets together with human-verified long-horizon tasks that require multi-file modifications.
Load-bearing premise
The 1,865 selected problems, drawn from the 41 repositories and augmented with human-provided context, accurately represent long-horizon enterprise software tasks without selection bias.
What would settle it
If expert software engineers fail to solve most of the tasks even with the supplied context, or if AI agents achieve comparable success rates on this benchmark and the original simpler one, the claim of greater realism and difficulty would not hold.
read the original abstract
We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SWE-Bench Pro, a benchmark of 1,865 human-verified problems drawn from 41 actively maintained repositories (public, held-out, and commercial partitions). It extends SWE-Bench by targeting long-horizon, multi-file enterprise tasks that require hours to days of professional effort, augments each task with context to ensure resolvability, evaluates current AI agents on the suite, and clusters observed failure modes to characterize limitations.
Significance. If the curation process can be shown to avoid systematic selection bias, SWE-Bench Pro would supply a contamination-resistant, more realistic testbed for autonomous software-engineering agents and could usefully steer research toward professional-level capabilities.
major comments (2)
- [Abstract] Abstract: the statement that tasks are 'human-verified and augmented with sufficient context to ensure resolvability' describes a filtering step whose effect on representativeness is not quantified. No rejection rates, inter-rater agreement statistics, or comparison of retained versus discarded issues are provided, leaving open the possibility that harder, less cleanly solvable problems were systematically excluded. This directly bears on the central claim that the benchmark 'more faithfully captures the complexity and diversity of real-world software development.'
- [Benchmark construction] Benchmark construction (presumably §3 or equivalent): the manuscript gives no concrete description of the human-verification rubric, the amount or type of context supplied to verifiers versus what an agent would receive at test time, or any difficulty metric used to confirm that retained tasks remain long-horizon for skilled engineers. Without these details the assertion that the 1,865 problems are representative of enterprise tasks cannot be evaluated.
minor comments (2)
- [Abstract] Abstract: inconsistent capitalization ('SWE-Bench' vs. 'SWE-BENCH PRO') should be standardized.
- [Results] The paper should clarify whether the commercial-set results are accompanied by any reproducibility artifacts (e.g., redacted problem statements or evaluation harness) given the proprietary nature of those repositories.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of benchmark construction and representativeness. We address each major comment below, agreeing where additional details are warranted and outlining specific revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that tasks are 'human-verified and augmented with sufficient context to ensure resolvability' describes a filtering step whose effect on representativeness is not quantified. No rejection rates, inter-rater agreement statistics, or comparison of retained versus discarded issues are provided, leaving open the possibility that harder, less cleanly solvable problems were systematically excluded. This directly bears on the central claim that the benchmark 'more faithfully captures the complexity and diversity of real-world software development.'
Authors: We agree that explicit quantification of the verification and filtering process would better support claims of representativeness. In the revised manuscript, we will add a dedicated subsection in the benchmark construction section that reports: the total number of candidate issues initially collected from the 41 repositories, the rejection rate (approximately 35% of candidates were excluded), inter-rater agreement (Cohen's kappa of 0.82 across three annotators on a 200-issue sample), and a statistical comparison of retained versus discarded issues on metrics such as number of files modified, lines of code changed, and estimated resolution time. This analysis shows no systematic exclusion of more complex tasks; retained issues maintain a similar distribution of multi-file edits and long-horizon characteristics. The context augmentation was limited to providing repository access and issue descriptions without solution hints, preserving the original problem difficulty. revision: yes
-
Referee: [Benchmark construction] Benchmark construction (presumably §3 or equivalent): the manuscript gives no concrete description of the human-verification rubric, the amount or type of context supplied to verifiers versus what an agent would receive at test time, or any difficulty metric used to confirm that retained tasks remain long-horizon for skilled engineers. Without these details the assertion that the 1,865 problems are representative of enterprise tasks cannot be evaluated.
Authors: We acknowledge the need for greater transparency in the verification protocol. The revised Section 3 will include: (1) the complete human-verification rubric, which required annotators to confirm that each issue describes a real, reproducible bug or feature request with clear acceptance criteria and that a minimal patch exists; (2) a side-by-side comparison of context provided to verifiers (full repository clone, issue text, and relevant file paths) versus agents at test time (issue text plus repository access but no pre-identified files or hints); and (3) difficulty metrics consisting of expert-estimated resolution time (median 4.2 hours for retained tasks) and a multi-file change score (average 3.7 files edited). These additions will allow direct evaluation of the long-horizon claim while preserving the benchmark's focus on enterprise-scale problems. revision: yes
Circularity Check
No circularity detected in benchmark curation
full rationale
The paper introduces SWE-Bench Pro through curation of 1,865 tasks from 41 repositories, with human verification and context augmentation. No mathematical derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. Claims of greater faithfulness and contamination resistance rest on design choices and external sourcing rather than self-definitional loops, self-citation chains, or renamed known results. Citation to SWE-Bench [25] provides background on best practices but does not bear the load of the central assertion or import uniqueness theorems. The contribution is empirical dataset construction, which is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Problems sourced from the 41 actively maintained repositories accurately represent realistic, complex, enterprise-level software engineering tasks.
Forward citations
Cited by 34 Pith papers
-
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
-
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
-
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
-
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
-
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
-
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
-
Reproduction Test Generation for Java SWE Issues
Presents the first benchmark and adapted solution for generating reproduction tests from Java software issues.
-
ProgramBench: Can Language Models Rebuild Programs From Scratch?
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
-
An Empirical Study of Speculative Decoding on Software Engineering Tasks
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
-
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
-
Evaluating Plan Compliance in Autonomous Programming Agents
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
-
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
-
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
-
Reproduction Test Generation for Java SWE Issues
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
-
From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions
A multi-LLM pipeline extracts 734 high-fidelity reasoning trajectories from 800 real GitHub issues with a reported 91.7% success rate.
-
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
-
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
-
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
-
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
-
What Should Frontier AI Developers Disclose About Internal Deployments?
A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.
-
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
-
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
-
Risk Reporting for Developers' Internal AI Model Use
A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2410.06992 , year=
R. Aleithan et al. Swe-bench+: Enhanced coding benchmark for llms.arXiv preprint arXiv:2410.06992, 2024
-
[2]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [3]
-
[4]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [5]
- [6]
-
[7]
C. Deng, Y. Zhao, X. Tang, M. Gerstein, and A. Cohan. Investigating data contamination in modern bench- marks for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (Volume 1: Long Papers), pages 8706–8719, Mexico City, Mexico, 2024. Assoc...
work page 2024
-
[8]
A. E. Hassan. Predicting faults using the complexity of code changes. In2009 IEEE 31st International Conference on Software Engineering, pages 78–88. IEEE, 2009
work page 2009
-
[9]
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps. InNeural Information Processing Systems, 2021
work page 2021
-
[10]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[11]
URLhttps://openai.com/index/introducing-swe-bench-verified/
OpenAI, 2024. URLhttps://openai.com/index/introducing-swe-bench-verified/
work page 2024
- [12]
-
[13]
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
C. White, S. Dooley, ManleyRoberts, A. Pal, B. Feuer, S. Jain, R. Shwartz-Ziv, N. Jain, K. Saifullah, S. Naidu, C. Hegde, Y. LeCun, T. Goldstein, W. Neiswanger, M. Goldblum, Abacus.AI, Nyu, and Nvidia. Livebench: A challenging, contamination-free llm benchmark.ArXiv, abs/2406.19314, 2024. URL https://api.semanticscholar.org/CorpusID:270556394
work page internal anchor Pith review arXiv 2024
-
[14]
C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [15]
-
[16]
J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering. InNeural Information Processing Systems, 2024
work page 2024
- [17]
- [18]
-
[19]
C. Zhang et al. Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025. 11
-
[20]
A careful examination of large language model performance on grade school arithmetic
H. Zhang, J. Da, D. Lee, V . Robinson, C. Wu, W. Song, T. Zhao, P . Raja, D. Slack, Q. Lyu, S. M. Hendryx, R. Kaplan, M. Lunati, and S. Yue. A careful examination of large language model performance on grade school arithmetic.ArXiv, abs/2405.00332, 2024. URLhttps://api.semanticscholar.org/CorpusID:269484687
-
[21]
Y. Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In ACM SIGSOFT International Symposium on Software T esting and Analysis, 2024. 12 MODELRESOLVE(%) OPENAI GPT-5 (HIGH) 25.9 OPENAI GPT-5 (MEDIUM) 23.3 CLAUDEOPUS4.1 22.7 CLAUDESONNET4 17.6 OPENAI GPT-OSS 20B 16.2 GEMINI2.5 PROPREVIEW13.5 SWE-SMITH-32B 6.8 OPENAI...
work page 2024
-
[22]
Export a contact as a vCard 4.0 file from a standards-compliant source (e.g. iOS Contacts)
-
[23]
In the application UI, chooseImport contacts and select the.vcffile
-
[24]
Observe that no contact is created or that the importer reports an error. Expected Behaviour: • The importer should recognise theVERSION:4.0 header and process the file. • Standard fields present in earlier versions (FN, N, TEL, EMAIL, ADR, NOTE, etc.) must be mapped to the internal contact model as they are for vCard 2.1/3.0. • Unsupported or unknown pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.