arxiv: 2509.16941 · v2 · submitted 2025-09-21 · 💻 cs.SE · cs.CL

Recognition: 1 theorem link

· Lean Theorem

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Xiang Deng , Jeff Da , Edwin Pan , Yannis Yiming He , Charles Ide , Kanak Garg , Niklas Lauffer , Andrew Park

show 14 more authors

Nitin Pasari Chetan Rane Karmini Sampath Maya Krishnan Srivatsa Kundurthy Sean Hendryx Zifan Wang Vijay Bharadwaj Jeff Holm Raja Aluri Chen Bo Calvin Zhang Noah Jacobson Bing Liu Brad Kenstler

Authors on Pith no claims yet

Pith reviewed 2026-05-12 13:43 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords software engineering benchmarkAI agentslong-horizon tasksenterprise softwaremulti-file patchescontamination resistanceagent evaluation

0 comments

The pith

SWE-Bench Pro introduces 1,865 human-verified problems from 41 repositories to test AI agents on realistic long-horizon software tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a new benchmark that addresses limitations in existing tests by drawing problems from a wider range of actively maintained repositories, including proprietary ones under partnership agreements. Tasks require substantial modifications across multiple files and can take professional engineers hours or days to complete, with all problems human-verified and supplied with enough context to remain solvable. Partitioning into public, held-out, and commercial sets reduces the risk that training data overlaps with evaluation items. This setup matters because it creates a clearer signal for whether AI agents are approaching the ability to manage complex, enterprise-grade development work without constant human oversight. The authors also examine failure patterns in agent trajectories to identify recurring error types.

Core claim

We introduce SWE-Bench Pro, a benchmark of 1,865 problems sourced from 41 repositories spanning business applications, B2B services, and developer tools. The problems feature long-horizon tasks that often involve patches across multiple files and substantial code changes, with all tasks human-verified and augmented with sufficient context to ensure they are solvable by skilled engineers. The benchmark is divided into a public set from 11 repositories, a held-out set from 12 repositories, and a commercial set from 18 proprietary repositories under formal agreements; results are released on the commercial set while protecting access to the problems themselves.

What carries the argument

The SWE-Bench Pro benchmark structure, defined by its division into public, held-out, and commercial repository sets together with human-verified long-horizon tasks that require multi-file modifications.

Load-bearing premise

The 1,865 selected problems, drawn from the 41 repositories and augmented with human-provided context, accurately represent long-horizon enterprise software tasks without selection bias.

What would settle it

If expert software engineers fail to solve most of the tasks even with the supplied context, or if AI agents achieve comparable success rates on this benchmark and the original simpler one, the claim of greater realism and difficulty would not hold.

read the original abstract

We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-Bench Pro is a scaled-up version of the original benchmark with more repositories and a practical access split, but the human verification step likely selects for solvable tasks and weakens the claim of better real-world fidelity.

read the letter

The main thing to know is that this paper mostly enlarges SWE-Bench rather than rethinking it. It pulls 1,865 tasks from 41 repositories, emphasizes multi-file patches and longer time horizons, and adds a three-way split into public, held-out, and commercial sets with some results released on the proprietary side. That split is the clearest practical addition, since it lets people test contamination resistance without handing over full enterprise codebases. The failure-mode clustering on agent trajectories is also a reasonable way to organize observations about where current models break. Those pieces give researchers a larger, more varied testbed than the first version and a workable model for handling real data under partnership agreements. The curation effort itself looks solid on paper, with human verification and context added to make tasks doable. The soft spot sits right in that verification step. The abstract states tasks were augmented with enough context to ensure resolvability, which implies a filter that drops problems even skilled engineers might struggle with in their natural state. Without reported rejection rates, inter-rater numbers, or a clear description of how context was chosen versus what an agent would actually see, it is difficult to judge whether the final set truly reflects enterprise complexity or just the subset that survives the filter. The claim that it more faithfully captures real development therefore rests on thinner ground than the scale alone suggests. This work is aimed at groups building and evaluating autonomous coding agents who already use SWE-Bench and want harder, more diverse tasks with some contamination controls. It is worth sending to peer review because the repository count, access model, and failure analysis are concrete and falsifiable contributions that referees can check against the data. The verification details will need tightening, but that is normal for a benchmark paper and does not make the core extension unusable.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SWE-Bench Pro, a benchmark of 1,865 human-verified problems drawn from 41 actively maintained repositories (public, held-out, and commercial partitions). It extends SWE-Bench by targeting long-horizon, multi-file enterprise tasks that require hours to days of professional effort, augments each task with context to ensure resolvability, evaluates current AI agents on the suite, and clusters observed failure modes to characterize limitations.

Significance. If the curation process can be shown to avoid systematic selection bias, SWE-Bench Pro would supply a contamination-resistant, more realistic testbed for autonomous software-engineering agents and could usefully steer research toward professional-level capabilities.

major comments (2)

[Abstract] Abstract: the statement that tasks are 'human-verified and augmented with sufficient context to ensure resolvability' describes a filtering step whose effect on representativeness is not quantified. No rejection rates, inter-rater agreement statistics, or comparison of retained versus discarded issues are provided, leaving open the possibility that harder, less cleanly solvable problems were systematically excluded. This directly bears on the central claim that the benchmark 'more faithfully captures the complexity and diversity of real-world software development.'
[Benchmark construction] Benchmark construction (presumably §3 or equivalent): the manuscript gives no concrete description of the human-verification rubric, the amount or type of context supplied to verifiers versus what an agent would receive at test time, or any difficulty metric used to confirm that retained tasks remain long-horizon for skilled engineers. Without these details the assertion that the 1,865 problems are representative of enterprise tasks cannot be evaluated.

minor comments (2)

[Abstract] Abstract: inconsistent capitalization ('SWE-Bench' vs. 'SWE-BENCH PRO') should be standardized.
[Results] The paper should clarify whether the commercial-set results are accompanied by any reproducibility artifacts (e.g., redacted problem statements or evaluation harness) given the proprietary nature of those repositories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of benchmark construction and representativeness. We address each major comment below, agreeing where additional details are warranted and outlining specific revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the statement that tasks are 'human-verified and augmented with sufficient context to ensure resolvability' describes a filtering step whose effect on representativeness is not quantified. No rejection rates, inter-rater agreement statistics, or comparison of retained versus discarded issues are provided, leaving open the possibility that harder, less cleanly solvable problems were systematically excluded. This directly bears on the central claim that the benchmark 'more faithfully captures the complexity and diversity of real-world software development.'

Authors: We agree that explicit quantification of the verification and filtering process would better support claims of representativeness. In the revised manuscript, we will add a dedicated subsection in the benchmark construction section that reports: the total number of candidate issues initially collected from the 41 repositories, the rejection rate (approximately 35% of candidates were excluded), inter-rater agreement (Cohen's kappa of 0.82 across three annotators on a 200-issue sample), and a statistical comparison of retained versus discarded issues on metrics such as number of files modified, lines of code changed, and estimated resolution time. This analysis shows no systematic exclusion of more complex tasks; retained issues maintain a similar distribution of multi-file edits and long-horizon characteristics. The context augmentation was limited to providing repository access and issue descriptions without solution hints, preserving the original problem difficulty. revision: yes
Referee: [Benchmark construction] Benchmark construction (presumably §3 or equivalent): the manuscript gives no concrete description of the human-verification rubric, the amount or type of context supplied to verifiers versus what an agent would receive at test time, or any difficulty metric used to confirm that retained tasks remain long-horizon for skilled engineers. Without these details the assertion that the 1,865 problems are representative of enterprise tasks cannot be evaluated.

Authors: We acknowledge the need for greater transparency in the verification protocol. The revised Section 3 will include: (1) the complete human-verification rubric, which required annotators to confirm that each issue describes a real, reproducible bug or feature request with clear acceptance criteria and that a minimal patch exists; (2) a side-by-side comparison of context provided to verifiers (full repository clone, issue text, and relevant file paths) versus agents at test time (issue text plus repository access but no pre-identified files or hints); and (3) difficulty metrics consisting of expert-estimated resolution time (median 4.2 hours for retained tasks) and a multi-file change score (average 3.7 files edited). These additions will allow direct evaluation of the long-horizon claim while preserving the benchmark's focus on enterprise-scale problems. revision: yes

Circularity Check

0 steps flagged

No circularity detected in benchmark curation

full rationale

The paper introduces SWE-Bench Pro through curation of 1,865 tasks from 41 repositories, with human verification and context augmentation. No mathematical derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. Claims of greater faithfulness and contamination resistance rest on design choices and external sourcing rather than self-definitional loops, self-citation chains, or renamed known results. Citation to SWE-Bench [25] provides background on best practices but does not bear the load of the central assertion or import uniqueness theorems. The contribution is empirical dataset construction, which is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper contributes a curated benchmark rather than a derivation; it rests on the domain assumption that real-repository tasks can be selected and verified to represent professional long-horizon work.

axioms (1)

domain assumption Problems sourced from the 41 actively maintained repositories accurately represent realistic, complex, enterprise-level software engineering tasks.
Invoked in the description of benchmark construction and the claim of faithful capture of real-world complexity.

pith-pipeline@v0.9.0 · 5628 in / 1265 out tokens · 48375 ms · 2026-05-12T13:43:45.457937+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
cs.AI 2026-05 unverdicted novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
cs.AI 2026-04 unverdicted novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
cs.SE 2026-05 unverdicted novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
cs.CL 2026-05 conditional novelty 7.0

LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.
Ask Early, Ask Late, Ask Right: When Does Clarification Timing Matter for Long-Horizon Agents?
cs.CL 2026-05 unverdicted novelty 7.0

Goal clarifications lose nearly all value after 10% of execution while input clarifications retain value until roughly 50%, and asking any type past mid-trajectory hurts performance more than never asking.
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation
cs.SE 2026-05 unverdicted novelty 7.0

LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution
cs.SE 2026-05 unverdicted novelty 7.0

TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
Reproduction Test Generation for Java SWE Issues
cs.SE 2026-05 unverdicted novelty 7.0

Presents the first benchmark and adapted solution for generating reproduction tests from Java software issues.
ProgramBench: Can Language Models Rebuild Programs From Scratch?
cs.SE 2026-05 unverdicted novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while...
An Empirical Study of Speculative Decoding on Software Engineering Tasks
cs.SE 2026-04 unverdicted novelty 7.0

Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
cs.AI 2026-04 unverdicted novelty 7.0

SkillFlow benchmark shows lifelong skill evolution yields modest gains for some models like Claude Opus 4.6 but limited or negative utility for others despite high skill usage.
Evaluating Plan Compliance in Autonomous Programming Agents
cs.SE 2026-04 unverdicted novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade i...
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
cs.AI 2026-04 unverdicted novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
cs.AI 2026-04 unverdicted novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
cs.SE 2026-04 accept novelty 7.0

Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
cs.AI 2026-04 unverdicted novelty 7.0

AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
cs.SE 2026-04 accept novelty 7.0

Large-scale trajectory analysis of 19 coding agents on 500 tasks finds that LLM choice drives outcomes more than framework design and that context-gathering plus validation behaviors improve success beyond task diffic...
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
cs.SE 2026-03 unverdicted novelty 7.0

Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution
cs.LG 2026-05 unverdicted novelty 6.0

SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
Reproduction Test Generation for Java SWE Issues
cs.SE 2026-05 unverdicted novelty 6.0

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
From Threads to Trajectories: A Multi-LLM Pipeline for Community Knowledge Extraction from GitHub Issue Discussions
cs.SE 2026-04 unverdicted novelty 6.0

A multi-LLM pipeline extracts 734 high-fidelity reasoning trajectories from 800 real GitHub issues with a reported 91.7% success rate.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 5.0

Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
cs.SE 2026-04 unverdicted novelty 5.0

KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
What Should Frontier AI Developers Disclose About Internal Deployments?
cs.CY 2026-04 unverdicted novelty 5.0

A framework recommending that frontier AI developers disclose information on capabilities, usage, safety mitigations, and governance of internal model deployments.
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
cs.AI 2026-04 unverdicted novelty 5.0

AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents
cs.AI 2026-05 unverdicted novelty 4.0

Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
cs.AI 2026-05 unverdicted novelty 4.0

A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
Risk Reporting for Developers' Internal AI Model Use
cs.CY 2026-04 unverdicted novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 33 Pith papers · 4 internal anchors

[1]

arXiv preprint arXiv:2410.06992 , year=

R. Aleithan et al. Swe-bench+: Enhanced coding benchmark for llms.arXiv preprint arXiv:2410.06992, 2024

work page arXiv 2024
[2]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Brown, B

T. Brown, B. Mann, N. Ryder, M. Subbiah, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[4]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Cheng, Z

Y. Cheng, Z. Li, and Y. Zhou. A survey on data contamination for large language models.arXiv preprint arXiv:2502.14425, 2025

work page arXiv 2025
[6]

J. Da, C. J. Wang, X. Deng, Y. Ma, N. Barhate, and S. M. Hendryx. Agent-rlvr: Training software en- gineering agents via guidance and environment rewards.ArXiv, abs/2506.11425, 2025. URL https: //api.semanticscholar.org/CorpusID:279391657

work page arXiv 2025
[7]

C. Deng, Y. Zhao, X. Tang, M. Gerstein, and A. Cohan. Investigating data contamination in modern bench- marks for large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (Volume 1: Long Papers), pages 8706–8719, Mexico City, Mexico, 2024. Assoc...

work page 2024
[8]

A. E. Hassan. Predicting faults using the complexity of code changes. In2009 IEEE 31st International Conference on Software Engineering, pages 78–88. IEEE, 2009

work page 2009
[9]

Hendrycks, S

D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps. InNeural Information Processing Systems, 2021

work page 2021
[10]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[11]

URLhttps://openai.com/index/introducing-swe-bench-verified/

OpenAI, 2024. URLhttps://openai.com/index/introducing-swe-bench-verified/

work page 2024
[12]

Steidl, B

D. Steidl, B. Hummel, and E. Jürgens. Evaluating code complexity triggers, use of complexity measures and the influence of code complexity on maintenance time.Empirical Software Engineering, 22(2):971–1015, 2017

work page 2017
[13]

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

C. White, S. Dooley, ManleyRoberts, A. Pal, B. Feuer, S. Jain, R. Shwartz-Ziv, N. Jain, K. Saifullah, S. Naidu, C. Hegde, Y. LeCun, T. Goldstein, W. Neiswanger, M. Goldblum, Abacus.AI, Nyu, and Nvidia. Livebench: A challenging, contamination-free llm benchmark.ArXiv, abs/2406.19314, 2024. URL https://api.semanticscholar.org/CorpusID:270556394

work page internal anchor Pith review arXiv 2024
[14]

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

C. Xu, J. Guan, X. Zhao, C. Fu, Q. Xin, Z. Wang, L. Li, J. Fu, H. Wang, and J. Liu. Benchmark data contamination of large language models: A survey.arXiv preprint arXiv:2406.04244, 2024

work page arXiv 2024
[16]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe-agent: Agent-computer interfaces enable automated software engineering. InNeural Information Processing Systems, 2024

work page 2024
[17]

J. Yang, C. E. Jimenez, A. Wettig, K. Narasimhan, and O. Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?arXiv preprint arXiv:2410.03859, 2024

work page arXiv 2024
[18]

D. Zan, Z. Huang, W. Liu, H. Chen, L. Zhang, S. Xin, L. Chen, Q. Liu, X. Zhong, A. Li, et al. Multi-swe-bench: A multilingual benchmark for issue resolving.arXiv preprint arXiv:2404.02605, 2024

work page arXiv 2024
[19]

2505.23419 , archivePrefix =

C. Zhang et al. Swe-bench goes live!arXiv preprint arXiv:2505.23419, 2025. 11

work page arXiv 2025
[20]

A careful examination of large language model performance on grade school arithmetic

H. Zhang, J. Da, D. Lee, V . Robinson, C. Wu, W. Song, T. Zhao, P . Raja, D. Slack, Q. Lyu, S. M. Hendryx, R. Kaplan, M. Lunati, and S. Yue. A careful examination of large language model performance on grade school arithmetic.ArXiv, abs/2405.00332, 2024. URLhttps://api.semanticscholar.org/CorpusID:269484687

work page arXiv 2024
[21]

Book 978

Y. Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In ACM SIGSOFT International Symposium on Software T esting and Analysis, 2024. 12 MODELRESOLVE(%) OPENAI GPT-5 (HIGH) 25.9 OPENAI GPT-5 (MEDIUM) 23.3 CLAUDEOPUS4.1 22.7 CLAUDESONNET4 17.6 OPENAI GPT-OSS 20B 16.2 GEMINI2.5 PROPREVIEW13.5 SWE-SMITH-32B 6.8 OPENAI...

work page 2024
[22]

iOS Contacts)

Export a contact as a vCard 4.0 file from a standards-compliant source (e.g. iOS Contacts)

work page
[23]

In the application UI, chooseImport contacts and select the.vcffile

work page
[24]

No actions recorded

Observe that no contact is created or that the importer reports an error. Expected Behaviour: • The importer should recognise theVERSION:4.0 header and process the file. • Standard fields present in earlier versions (FN, N, TEL, EMAIL, ADR, NOTE, etc.) must be mapped to the internal contact model as they are for vCard 2.1/3.0. • Unsupported or unknown pro...

work page