SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

· 2025 · cs.SE · arXiv 2512.18470

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

citation-role summary

background 3

citation-polarity summary

background 2 support 1

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.

PBT-Bench: Benchmarking AI Agents on Property-Based Testing

cs.SE · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.

Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.

The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents

cs.SE · 2026-05-04 · unverdicted · novelty 5.0

Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.

More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems

cs.SE · 2026-04-20 · unverdicted · novelty 5.0

AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

citing papers explorer

Showing 9 of 9 citing papers.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 67 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades cs.SE · 2026-05-14 · unverdicted · none · ref 92 · internal anchor
SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.
PBT-Bench: Benchmarking AI Agents on Property-Based Testing cs.SE · 2026-05-13 · unverdicted · none · ref 14 · 2 links · internal anchor
PBT-Bench is a new benchmark of 100 property-based testing problems with 365 injected semantic bugs across 40 Python libraries that measures LLMs on deriving invariants and precise input-generation strategies.
Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution cs.SE · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
TEBench is a new project-level benchmark for test evolution showing coding agents achieve only 45-49% F1 on identifying tests needing changes, with stale tests hardest due to reliance on execution failures.
ProgramBench: Can Language Models Rebuild Programs From Scratch? cs.SE · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems cs.AI · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
The Conversations Beneath the Code: Triadic Data for Long-Horizon Software Engineering Agents cs.SE · 2026-05-04 · unverdicted · none · ref 39 · internal anchor
Triadic data—synchronized human-human conversations, human-AI sessions, and cross-functional team work—is the essential substrate for training long-horizon software engineering agents.
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems cs.SE · 2026-04-20 · unverdicted · none · ref 39 · internal anchor
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 202 · internal anchor
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer