SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

· 2026 · cs.SE · arXiv 2606.07682

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

AI agents are increasingly expected to complete long-horizon workflows that require sustained progress over hours, millions of tokens, and complex environments. Yet current agent benchmarks largely evaluate short-form tasks, such as single pull requests, small tickets, or 5-10 minute exercises, limiting our ability to measure agents' capabilities in planning, long-context understanding, and memory use. We introduce SWE-Marathon, a benchmark of 20 long-horizon tasks spanning software engineering and adjacent technical domains. Each task consists of a unique executable environment, a human-written reference solution, and a multi-layer verification suite. Logged agent attempts average 27.2M total tokens, making SWE-Marathon substantially longer-horizon than existing SWE and command-line agent benchmarks. Current frontier coding agents solve fewer than 30% of tasks. Failures often arise from poor self-verification, self-reported infeasibility, and premature termination. We also observe reward-hacking behavior in 13.8% of rollouts, where agents attempt to exploit the environment or verifier to bypass the intended workflow. SWE-Marathon includes adversarial review of test suites and execution environments, as well as multi-layer checks designed to prevent shortcut solutions. We release SWE-Marathon, evaluation code, and agent trajectories at https://swe-marathon.org/.

representative citing papers

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

SWE-Interact shows frontier models solve roughly 25% of multi-turn interactive coding tasks versus 50% on single-turn baselines.

MirrorCode: AI can rebuild entire programs from behavior alone

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.

citing papers explorer

Showing 2 of 2 citing papers.

SWE-INTERACT: Reimagining SWE Benchmarks as User-Driven Long-Horizon Coding Sessions cs.LG · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
SWE-Interact shows frontier models solve roughly 25% of multi-turn interactive coding tasks versus 50% on single-turn baselines.
MirrorCode: AI can rebuild entire programs from behavior alone cs.AI · 2026-06-29 · unverdicted · none · ref 14 · internal anchor
MirrorCode benchmark shows current AI models achieving up to 56% success reimplementing 25 diverse full programs from behavior alone, including a 16,000-line bioinformatics toolkit.

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

fields

years

verdicts

representative citing papers

citing papers explorer