BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Chang Liu; Cheng Chen; Daixuan Cheng; Fanzhe Meng; Guoxin Chen; Huatong Song; Hui Chen; Jiale Zhao; Jie Chen; Ji-Rong Wen

arxiv: 2603.03194 · v2 · pith:DMEEB2GTnew · submitted 2026-03-03 · 💻 cs.CL · cs.SE

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

Guoxin Chen , Fanzhe Meng , Jiale Zhao , Minghao Li , Daixuan Cheng , Huatong Song , Jie Chen , Yuzhi Lin

show 7 more authors

Hui Chen Xin Zhao Ruihua Song Chang Liu Cheng Chen Kai Jia Ji-Rong Wen

This is my paper

classification 💻 cs.CL cs.SE

keywords beyondsweresolutionagentsbroadercodecurrentexternalissue

0 comments

read the original abstract

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks
cs.NI 2026-04 unverdicted novelty 6.0

SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
cs.SE 2026-04 unverdicted novelty 6.0

REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.