Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
arXiv preprint arXiv:2503.21710 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
baseline 1polarities
baseline 1representative citing papers
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
DUALVIEW is a dual-modal framework using Module Coupling, Function Call, Class Hierarchy, and Program Dependence graphs to enable persistent structural reasoning for agentic issue resolution, reporting gains on SWE-bench Pro and Verified.
PatchFusion uses deterministic atomic evidence fusion on candidate patches to outperform ranking, test-filtering, and LLM-judge selectors on SWE-bench and Defects4J pools.
RepoRescue creates a benchmark of 315 repositories and shows LLM agents rescue up to 41.5% with runtime enforcement and 62.7% when combining systems, with hardest cases requiring cross-file changes.
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.