RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.
arXiv preprint arXiv:2506.18824 , year=
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 12representative citing papers
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
TraceView organizes agentic APR trajectories into Thought-Action-Result components for semantic labeling and renders them as interactive graphs, with a user study showing improved scanability and understanding for five researchers.
Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon computer-use agents.
Exploratory interview study with 17 developers identifies four forms of emergent oversight work for software agents and documents situated challenges and heuristics.
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
Ada is a scoped apparatus that records SWE-agent trajectories in real repositories and applies observation lenses to project navigation, evidence selection, synthesis, grounding, and stopping behaviors across 408 runs.
Case study of 12 LLM agent pairs on Fibonacci game development finds only DeepSeek-R1:DeepSeek-R1 converges correctly from the first iteration while others either diverge or fail to converge.
citing papers explorer
-
RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations
RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
-
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
-
Agentic Much? Adoption of Coding Agents on GitHub
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
-
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
-
TraceView: Interactive Visualization of Agentic Program Repair Trajectories
TraceView organizes agentic APR trajectories into Thought-Action-Result components for semantic labeling and renders them as interactive graphs, with a user study showing improved scanability and understanding for five researchers.
-
LLM-as-Code: Agentic Programming for Agent Harness
Proposes Agentic Programming in which programs control execution flow and LLMs act as invoked components (LLM-as-Code) only for reasoning, producing DAG-structured contexts that improve stability in long-horizon computer-use agents.
-
Human oversight of agentic systems in practice: Examining the oversight work, challenges, and heuristics of developers using software agents
Exploratory interview study with 17 developers identifies four forms of emergent oversight work for software agents and documents situated challenges and heuristics.
-
FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search
FALAT improves failure attribution in LLM agent trajectories via dependency-guided search, achieving 46.0% step-level accuracy on algorithm-generated and 29.1% on hand-crafted trajectories in the Who&When benchmark.
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
Projecting the Emerging Mindset of SWE Agent by Launching a Wild Code Understanding Journey
Ada is a scoped apparatus that records SWE-agent trajectories in real repositories and applies observation lenses to project navigation, evidence selection, synthesis, grounding, and stopping behaviors across 408 runs.
-
Understanding Conversational Patterns in Multi-agent Programming: A Case Study on Fibonacci Game Development
Case study of 12 LLM agent pairs on Fibonacci game development finds only DeepSeek-R1:DeepSeek-R1 converges correctly from the first iteration while others either diverge or fail to converge.