Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
arXiv preprint arXiv:2503.21710 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 9roles
baseline 1polarities
baseline 1representative citing papers
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
DUALVIEW is a dual-modal framework using Module Coupling, Function Call, Class Hierarchy, and Program Dependence graphs to enable persistent structural reasoning for agentic issue resolution, reporting gains on SWE-bench Pro and Verified.
PatchFusion uses deterministic atomic evidence fusion on candidate patches to outperform ranking, test-filtering, and LLM-judge selectors on SWE-bench and Defects4J pools.
RepoRescue creates a benchmark of 315 repositories and shows LLM agents rescue up to 41.5% with runtime enforcement and 62.7% when combining systems, with hardest cases requiring cross-file changes.
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.
citing papers explorer
-
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
-
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
-
Beyond Textual Repository Exploration: Dual-Modal Structural Reasoning for Agentic Issue Resolution
DUALVIEW is a dual-modal framework using Module Coupling, Function Call, Class Hierarchy, and Program Dependence graphs to enable persistent structural reasoning for agentic issue resolution, reporting gains on SWE-bench Pro and Verified.
-
A Single Patch Is Not Enough: Deterministic Fusion of Repair Candidates
PatchFusion uses deterministic atomic evidence fusion on candidate patches to outperform ranking, test-filtering, and LLM-judge selectors on SWE-bench and Defects4J pools.
-
RepoRescue: An Empirical Study of LLM Agents on Whole-Repository Compatibility Rescue
RepoRescue creates a benchmark of 315 repositories and shows LLM agents rescue up to 41.5% with runtime enforcement and 62.7% when combining systems, with hardest cases requiring cross-file changes.
-
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
-
ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair
ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.