{"paper":{"title":"SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Coding agents resolve an average of 44.8 percent of chained release-level package upgrades while preserving prior functionality.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Chaozheng Wang, Haau-sing Li, Hange Liu, Jen-tse Huang, Jingyu Xiao, Man Ho Lam, Michael R. Lyu, Terry Yue Zhuo","submitted_at":"2026-05-14T06:04:40Z","abstract_excerpt":"Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent's prior codebase. To produce upgrade specifications, we design a divide-a"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 leading at 60.8% resolving, and current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The divide-and-conquer synthesis pipeline produces upgrade specifications that are both grounded in actual code changes and feasible for agents to implement without introducing artificial simplifications that do not occur in real maintenance.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Coding agents resolve an average of 44.8 percent of chained release-level package upgrades while preserving prior functionality.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"bb2e5bfacec09bba1aa6e2ddfb8cd6a94368fd25f7c771c4659b950422ab6a15"},"source":{"id":"2605.14415","kind":"arxiv","version":1},"verdict":{"id":"0e1c31f3-05b2-4e67-ac63-f6944555f54a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:24:36.713507Z","strongest_claim":"Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 leading at 60.8% resolving, and current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.","one_line_summary":"SWE-Chain provides 155 chained version transitions and 1,660 requirements across 9 Python packages, where frontier agents resolve 44.8% of tasks on average and struggle to preserve functionality across releases.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The divide-and-conquer synthesis pipeline produces upgrade specifications that are both grounded in actual code changes and feasible for agents to implement without introducing artificial simplifications that do not occur in real maintenance.","pith_extraction_headline":"Coding agents resolve an average of 44.8 percent of chained release-level package upgrades while preserving prior functionality."},"references":{"count":95,"sample":[{"doi":"","year":null,"title":"Claude Code overview , author=","work_id":"67d3f9d1-c866-44d6-90e4-f7f1dc00a99a","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Introducing Codex , author=. 2025 , url=","work_id":"ac6a07b6-623e-4a61-8481-0086ef9f711a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"OpenCode: The open source AI coding agent , author=. 2025 , url=","work_id":"ee36e7fc-a03f-4bf2-9c39-eb5e8db45499","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Xu and Xiangru Tang and Mingchen Zhuge and Jiayi Pan and Yueqi Song and Bowen Li and Jaskirat Singh and Hoang H","work_id":"a9881734-84a6-4af0-8ee1-8d46cbf56057","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"John Yang and Carlos E Jimenez and Alexander Wettig and Kilian Lieret and Shunyu Yao and Karthik R Narasimhan and Ofir Press , booktitle=","work_id":"10938d0d-9204-4623-ba08-4545471bbdcf","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":95,"snapshot_sha256":"7e2d8c4c57418a6f7291bea0e79d3723a9dfd54b29853fb2f56ea9ab9e8959f0","internal_anchors":6},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}