hub Canonical reference

Demystifying llm-based software engineering agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang · 2025 · Proceedings of the ACM on Software Engineering · DOI 10.1145/3715754

Canonical reference. 82% of citing Pith papers cite this work as background.

27 Pith papers citing it

38 external citations · Crossref

Background 82% of classified citations

open at publisher browse 27 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 9 baseline 1 method 1

citation-polarity summary

background 9 baseline 1 use method 1

representative citing papers

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

cs.SE · 2026-05-07 · conditional · novelty 8.0

LLMs frequently specify library versions with known CVEs in generated code (36-56% of tasks), show low compatibility (20-63%), and converge on the same risky versions across models.

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

cs.SE · 2026-05-18 · conditional · novelty 7.0

The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.

Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

BluesFL uses block-level instruction-oriented slicing with LLMs to localize 24 bugs at Top-1 in a 19K-line RISC-V processor, a 242.9% gain over prior SOTA of 7 bugs.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

cs.SE · 2026-05-09 · unverdicted · novelty 7.0

PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

TACT identifies drift axes in residual stream activations separating overthinking, overacting, and calibrated steps, then steers test-time activations toward the calibrated region to raise resolve rates by 4.8-5.8 pp and cut steps by up to 26% on coding benchmarks.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

cs.SE · 2026-04-03 · conditional · novelty 7.0

AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

cs.SE · 2026-01-21 · unverdicted · novelty 7.0

A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.

Investigating Test Overfitting on SWE-bench

cs.SE · 2025-11-20 · unverdicted · novelty 7.0

The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing

cs.SE · 2026-05-14 · conditional · novelty 6.0

FuzzAgent deploys specialized agents that collaborate on harness generation, execution, and crash triage to evolve fuzzing campaigns, delivering 45-191% more branch coverage than four baselines on 20 C/C++ libraries and surfacing 102 real bugs.

EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair

cs.SE · 2026-05-09 · unverdicted · novelty 6.0

EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.

Reproduction Test Generation for Java SWE Issues

cs.SE · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.

SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents

cs.SE · 2026-04-20 · unverdicted · novelty 6.0

SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.

CodeTracer: Towards Traceable Agent States

cs.SE · 2026-04-13 · unverdicted · novelty 6.0

CodeTracer reconstructs hierarchical state-transition trees from heterogeneous code-agent run artifacts and localizes failure onsets, outperforming baselines on the new CodeTraceBench dataset while enabling recovery of failed trajectories.

From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents

cs.SE · 2026-04-02 · unverdicted · novelty 6.0

A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.

Can Old Tests Do New Tricks for Resolving SWE Issues?

cs.SE · 2025-10-21 · conditional · novelty 6.0

TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.

Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths

cs.SE · 2025-10-01 · conditional · novelty 6.0

PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.

citing papers explorer

Showing 27 of 27 citing papers.

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions cs.SE · 2026-05-07 · conditional · none · ref 74
LLMs frequently specify library versions with known CVEs in generated code (36-56% of tasks), show low compatibility (20-63%), and converge on the same risky versions across models.
Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents cs.SE · 2026-05-18 · conditional · none · ref 25
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing cs.SE · 2026-05-17 · unverdicted · none · ref 26
BluesFL uses block-level instruction-oriented slicing with LLMs to localize 24 bugs at Top-1 in a 19K-line RISC-V processor, a 242.9% gain over prior SOTA of 7 bugs.
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements cs.SE · 2026-05-17 · unverdicted · none · ref 46
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis cs.CR · 2026-05-15 · unverdicted · none · ref 29
Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents cs.SE · 2026-05-09 · unverdicted · none · ref 41
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals cs.SE · 2026-05-08 · unverdicted · none · ref 42 · 2 links
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair cs.SE · 2026-05-07 · unverdicted · none · ref 26 · 2 links
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering cs.AI · 2026-05-07 · unverdicted · none · ref 1
TACT identifies drift axes in residual stream activations separating overthinking, overacting, and calibrated steps, then steers test-time activations toward the calibrated region to raise resolve rates by 4.8-5.8 pp and cut steps by up to 26% on coding benchmarks.
Deep Graph-Language Fusion for Structure-Aware Code Generation cs.SE · 2026-05-05 · unverdicted · none · ref 25
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair cs.SE · 2026-05-04 · unverdicted · none · ref 37
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories cs.SE · 2026-04-08 · unverdicted · none · ref 89
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits cs.SE · 2026-04-03 · conditional · none · ref 44
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling cs.SE · 2026-01-21 · unverdicted · none · ref 97
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
Investigating Test Overfitting on SWE-bench cs.SE · 2025-11-20 · unverdicted · none · ref 20
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation cs.LG · 2025-11-11 · unverdicted · none · ref 23
MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing cs.SE · 2026-05-14 · conditional · none · ref 63
FuzzAgent deploys specialized agents that collaborate on harness generation, execution, and crash triage to evolve fuzzing campaigns, delivering 45-191% more branch coverage than four baselines on 20 C/C++ libraries and surfacing 102 real bugs.
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair cs.SE · 2026-05-09 · unverdicted · none · ref 43
EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
Reproduction Test Generation for Java SWE Issues cs.SE · 2026-05-05 · unverdicted · none · ref 31 · 2 links
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
SelfHeal: Empirical Fix Pattern Analysis and Bug Repair in LLM Agents cs.SE · 2026-04-20 · unverdicted · none · ref 72
SelfHeal uses two ReAct agents and empirical fix patterns to repair bugs in LLM agents, outperforming baselines on a new 37-instance benchmark.
CodeTracer: Towards Traceable Agent States cs.SE · 2026-04-13 · unverdicted · none · ref 2
CodeTracer reconstructs hierarchical state-transition trees from heterogeneous code-agent run artifacts and localizes failure onsets, outperforming baselines on the new CodeTraceBench dataset while enabling recovery of failed trajectories.
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents cs.SE · 2026-04-02 · unverdicted · none · ref 37
A two-stage SFT pipeline distills execution-free then execution-based trajectories from a 480B model into smaller Qwen2.5-Coder agents, yielding 62.2% resolution on SWE-bench Verified and 44.1% zero-shot on the multilingual version.
Can Old Tests Do New Tricks for Resolving SWE Issues? cs.SE · 2025-10-21 · conditional · none · ref 42
TestPrune minimizes regression test suites to improve bug reproduction and patch validation in LLM-based agentic repair pipelines, delivering 6-13% relative gains on SWE-Bench benchmarks at low API cost.
Multi-LLM Orchestration for High-Quality Code Generation: Exploiting Complementary Model Strengths cs.SE · 2025-10-01 · conditional · none · ref 67
PerfOrch is a four-agent multi-LLM system that uses offline profiling to build language-and-category rankings for routing tasks, achieving 97.19% and 95.83% pass@1 on HumanEval-X and EffiBench-X with generalization across benchmarks.
"Refactoring Runaway": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution cs.SE · 2026-05-21 · unverdicted · none · ref 52
Empirical study finds coding agents produce fewer and less intense tangled refactorings than humans on Multi-SWE-bench; a refactoring-aware refinement improves compilability from 19.34% to 38.33% and resolves 2.79% more issues.
From Junior to Senior: Allocating Agency and Navigating Professional Growth in Agentic AI-Mediated Software Engineering cs.HC · 2026-01-31 · unverdicted · none · ref 92
Organizational policies constrain agency in AI-mediated software engineering more than individual preferences, with seniors using detailed delegation and pre-AI instincts while juniors oscillate between over-reliance and avoidance.
LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR · 2026-05-01 · unverdicted · none · ref 200 · 2 links
Argues for a denoising-first paradigm in LLM-oriented information retrieval, framing challenges via a four-stage progression and providing a taxonomy of signal-to-noise optimization techniques across the pipeline.

Demystifying llm-based software engineering agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer