{"paper":{"title":"SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Code agents show sharply lower success rates when handling complete issue resolution autonomously versus in isolated subtasks.","cross_cats":[],"primary_cat":"cs.SE","authors_text":"Hao Guan, Kangning Zhang, Lingyue Fu, Lin Qiu, Shao Zhang, Weinan Zhang, Weiwen Liu, Xuezhi Cao, Xunliang Cai, Yaoming Zhu, Yong Yu","submitted_at":"2026-05-13T08:05:16Z","abstract_excerpt":"As autonomous code agents move toward end-to-end software development, evaluating their practical autonomy becomes critical. Current benchmarks hide friction by testing agents in pre-configured environments, and their static evaluation pipelines frequently fail when parsing fully autonomous trajectories. We address these limitations with SWE-Cycle, a benchmark of 489 rigorously filtered instances. SWE-Cycle evaluates agents across three isolated tasks, including environment reconstruction, code implementation, and verification test generation, as well as an end-to-end FullCycle task that integ"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"The results reveal a sharp drop in solve rates when transitioning from isolated tasks to FullCycle execution, exposing critical bottlenecks in handling cross-phase dependencies and maintaining code quality.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 489 rigorously filtered instances and the SWE-Judge evaluation accurately capture practical autonomy without introducing selection bias or verification errors that would change the observed performance drop.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Code agents show sharply lower success rates when handling complete issue resolution autonomously versus in isolated subtasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c5e3a9f10502354c2d86f7c689b5952b727394f972ea95112b069d02d33983bb"},"source":{"id":"2605.13139","kind":"arxiv","version":1},"verdict":{"id":"a718588a-517f-4647-8103-fb1943071fd3","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T18:32:49.817448Z","strongest_claim":"The results reveal a sharp drop in solve rates when transitioning from isolated tasks to FullCycle execution, exposing critical bottlenecks in handling cross-phase dependencies and maintaining code quality.","one_line_summary":"SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 489 rigorously filtered instances and the SWE-Judge evaluation accurately capture practical autonomy without introducing selection bias or verification errors that would change the observed performance drop.","pith_extraction_headline":"Code agents show sharply lower success rates when handling complete issue resolution autonomously versus in isolated subtasks."},"references":{"count":66,"sample":[{"doi":"","year":null,"title":"Claude 4.6 sonnet system card","work_id":"85bdec73-f606-453f-ac9a-34ba951a6b05","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"URL https://assets.anthropic.com/m/785e231869ea8b3b/original/ Claude-4-6-Sonnet-System-Card.pdf","work_id":"9dc8b2b7-21b7-4e27-ac6c-d6ac7d3ec38f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Anthropic. Claude code, 2025. URLhttps://github.com/anthropics/claude-code","work_id":"a974e0ed-b8d8-4534-a3aa-34731e913fbf","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Introducing claude opus 4.5","work_id":"135f65e0-92b8-43bb-8f8e-a7cd633b7005","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Why Do Multi-Agent LLM Systems Fail?","work_id":"b186294a-cda7-4df0-9a28-27d379af92b2","ref_index":5,"cited_arxiv_id":"2503.13657","is_internal_anchor":true}],"resolved_work":66,"snapshot_sha256":"8b4734778106998e880b2fe3f427d0e9dea9e4ede5dd060260ff25b3bc86b55c","internal_anchors":12},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}