BeyondSWE: A comprehensive benchmark for evaluating code agents beyond narrow bug fixing

· 2026 · cs.CL · DOI 10.48550/arxiv.2603.03194 · arXiv 2603.03194

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Current code-agent benchmarks primarily evaluate localized issue resolution within a single target repository, leaving under-tested many software engineering tasks that require external knowledge or broader repository-level changes. We introduce BeyondSWE, a 500-instance benchmark drawn from 246 real-world GitHub repositories to evaluate code agents beyond single-repository bug fixing. BeyondSWE covers four representative settings: cross-repository issue resolution, domain-specific issue resolution, dependency-driven migration, and document-to-repository generation, spanning both broader knowledge scope and broader resolution scope. Our evaluation shows that BeyondSWE remains far from saturated: the best OpenHands-based agent reaches 46.12 average score, while the strongest Codex harness with GPT-5.4 (xhigh) reaches 56.65 under a search-aware prompt. To study whether external information access closes this gap, we use SearchSWE as a controlled diagnostic baseline for search-augmented coding. Search access improves most models and substantially helps some tasks, but the gains remain limited and uneven, showing that current agents still struggle to convert retrieved information into precise, version-compatible, and locally actionable code changes. These results suggest that deep search for coding remains an open problem: progress requires agents that can reliably combine external evidence with repository-local reasoning and execution-based verification.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

SWE-Explore is a new benchmark evaluating repository exploration by coding agents on 848 issues across 203 repositories, using line-level ground truth from successful agent trajectories and showing agentic methods outperform classical retrieval on coverage and ranking.

From Execution to Education: A Bloom-Aligned Framework for Measuring Educational Control in LLMs

cs.CL · 2026-07-09 · conditional · novelty 6.5

On 2,520 programming tasks, matched Qwen general and coder models reliably raise Bloom cognitive demand but fail to lower it, so execution skill does not imply educational control.

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

cs.MA · 2026-05-28 · unverdicted · novelty 6.0

Meta-Team is a collaborative self-evolution framework that turns multi-agent execution experience into reusable improvements at agent, coordination, and team levels, outperforming baselines on six benchmarks.

SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks

cs.NI · 2026-04-29 · unverdicted · novelty 6.0

SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.

REAgent: Requirement-Driven LLM Agents for Software Issue Resolution

cs.SE · 2026-04-08 · unverdicted · novelty 6.0

REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.

citing papers explorer

Showing 5 of 5 citing papers.

SWE-Explore: Benchmarking How Coding Agents Explore Repositories cs.SE · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
SWE-Explore is a new benchmark evaluating repository exploration by coding agents on 848 issues across 203 repositories, using line-level ground truth from successful agent trajectories and showing agentic methods outperform classical retrieval on coverage and ranking.
From Execution to Education: A Bloom-Aligned Framework for Measuring Educational Control in LLMs cs.CL · 2026-07-09 · conditional · none · ref 28 · internal anchor
On 2,520 programming tasks, matched Qwen general and coder models reliably raise Bloom cognitive demand but fail to lower it, so execution skill does not imply educational control.
Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems cs.MA · 2026-05-28 · unverdicted · none · ref 11 · internal anchor
Meta-Team is a collaborative self-evolution framework that turns multi-agent execution experience into reusable improvements at agent, coordination, and team levels, outperforming baselines on six benchmarks.
SWE-Bench 5G: Benchmarking AI Coding Agents on Telecom Network Engineering Tasks cs.NI · 2026-04-29 · unverdicted · none · ref 17 · internal anchor
SWE-Bench 5G is the first benchmark for AI agents fixing bugs in 5G core network software, showing high diagnosis rates but low resolution that improves conditionally with specification context.
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution cs.SE · 2026-04-08 · unverdicted · none · ref 9 · internal anchor
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.

BeyondSWE: A comprehensive benchmark for evaluating code agents beyond narrow bug fixing

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer