hub Canonical reference

Demystifying llm-based software engineering agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang · 2025 · Proceedings of the ACM on Software Engineering · DOI 10.1145/3715754

Canonical reference. 82% of citing Pith papers cite this work as background.

32 Pith papers citing it

38 external citations · Crossref

Background 82% of classified citations

open at publisher browse 32 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 9 baseline 1 method 1

citation-polarity summary

background 9 baseline 1 use method 1

representative citing papers

Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions

cs.SE · 2026-05-07 · conditional · novelty 8.0

LLMs frequently specify library versions with known CVEs in generated code (36-56% of tasks), show low compatibility (20-63%), and converge on the same risky versions across models.

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

cs.SE · 2026-05-18 · conditional · novelty 7.0

The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.

Debug Like a Human: Scaling LLM-based Fault Localization to Processor Design via Block-Level Instruction-Oriented Slicing

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

BluesFL uses block-level instruction-oriented slicing with LLMs to localize 24 bugs at Top-1 in a 19K-line RISC-V processor, a 242.9% gain over prior SOTA of 7 bugs.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.

MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

cs.SE · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.

SmellBench: Evaluating LLM Agents on Architectural Code Smell Repair

cs.SE · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

TACT identifies drift axes in residual stream activations separating overthinking, overacting, and calibrated steps, then steers test-time activations toward the calibrated region to raise resolve rates by 4.8-5.8 pp and cut steps by up to 26% on coding benchmarks.

Deep Graph-Language Fusion for Structure-Aware Code Generation

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.

ARISE: A Repository-level Graph Representation and Toolset for Agentic Fault Localization and Program Repair

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

cs.SE · 2026-04-03 · conditional · novelty 7.0

AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

cs.SE · 2026-01-21 · unverdicted · novelty 7.0

A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.

Investigating Test Overfitting on SWE-bench

cs.SE · 2025-11-20 · unverdicted · novelty 7.0

The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.

MURPHY: Feedback-Aware GRPO with Retrospective Credit Assignment for Multi-Turn Code Generation

cs.LG · 2025-11-11 · unverdicted · novelty 7.0

MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.

JETO-Bench: A Reproducible Benchmark for Execution Time Improvement Patches in Java

cs.SE · 2026-06-30 · conditional · novelty 6.0

JETO-Mine is a reusable three-phase pipeline that mines 1.8 million Java commits to produce JETO-Bench containing 91 verified executable ETIPs, on which OpenHands succeeds at 14.3%.

FeatX: Editing Software by Editing Features for Repository-Level Code Evolution

cs.SE · 2026-06-30 · unverdicted · novelty 6.0

FeatX extracts epic-feature hierarchies with code mappings from repositories and applies feature edits via a three-stage Evolution Agent, reporting 42.6% relative F1 gain in function-level localization and lower cognitive load versus vanilla ChatGPT in a user study and 38-commit replay.

Loc2Repair: A Framework for Evaluating the Impact of File-Level Issue Localization in Repo-Level LLM Repair

cs.SE · 2026-06-29 · accept · novelty 6.0

Loc2Repair framework evaluation finds that file-level localization boosts LLM repo repair resolved rates by up to 7.7 percentage points on SWE-bench Verified.

To Run or Not to Run: Analyzing the Cost-Effectiveness of Code Execution in LLM-Based Program Repair

cs.SE · 2026-06-25 · unverdicted · novelty 6.0

Empirical analysis of LLM repair agents shows execution provides concentrated benefits, with restrictions causing only a 1.25 pp non-significant drop in resolve rate while cutting token and time costs.

FuzzAgent: Multi-Agent System for Evolutionary Library Fuzzing

cs.SE · 2026-05-14 · conditional · novelty 6.0

FuzzAgent deploys specialized agents that collaborate on harness generation, execution, and crash triage to evolve fuzzing campaigns, delivering 45-191% more branch coverage than four baselines on 20 C/C++ libraries and surfacing 102 real bugs.

Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

cs.SE · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

PROBE turns runtime telemetry from failed software engineering agent runs into evidence-grounded diagnoses and actionable recovery guidance, achieving 65.37% diagnosis accuracy and 21.79% recovery rate on 257 cases.

EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair

cs.SE · 2026-05-09 · unverdicted · novelty 6.0

EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.

Reproduction Test Generation for Java SWE Issues

cs.SE · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

cs.CY · 2026-05-05 · conditional · novelty 6.0

A pre-registered bibliometric audit of 18,574 LLM papers finds a median 10.85 ECI lag behind the contemporaneous frontier, widening at 5.53 ECI/year, with only 3.2% of abstracts disclosing reasoning mode.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Demystifying llm-based software engineering agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer