The SWE-Bench Illusion.arXiv:2506.12286, June

· 2025 · arXiv 2506.12286

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

cs.SE · 2025-07-20 · conditional · novelty 8.0

AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.

Evaluating Plan Compliance in Autonomous Programming Agents

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.

Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?

cs.SE · 2026-02-20 · conditional · novelty 7.0

Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.

SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.

Reproduction Test Generation for Java SWE Issues

cs.SE · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.

Diagnosing CFG Interpretation in LLMs

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.

Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation

cs.SE · 2026-04-06 · unverdicted · novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

citing papers explorer

Showing 7 of 7 citing papers.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering cs.SE · 2025-07-20 · conditional · none · ref 22
AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.
Evaluating Plan Compliance in Autonomous Programming Agents cs.SE · 2026-04-13 · unverdicted · none · ref 15
Autonomous programming agents frequently fail to follow instructed plans, falling back on incomplete internalized workflows, while standard plans and periodic reminders improve performance but poor plans can degrade it more than no plan.
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs? cs.SE · 2026-02-20 · conditional · none · ref 5
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle cs.SE · 2026-05-13 · unverdicted · none · ref 22
SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Reproduction Test Generation for Java SWE Issues cs.SE · 2026-05-05 · unverdicted · none · ref 21 · 2 links
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
Diagnosing CFG Interpretation in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 27
LLMs maintain surface syntax for novel CFGs but fail to preserve semantics under recursion and branching, relying on keyword bootstrapping rather than pure symbolic reasoning.
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation cs.SE · 2026-04-06 · unverdicted · none · ref 17
Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

The SWE-Bench Illusion.arXiv:2506.12286, June

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer