LLMs frequently specify library versions with known CVEs in generated code (36-56% of tasks), show low compatibility (20-63%), and converge on the same risky versions across models.
hub Canonical reference
Demystifying llm-based software engineering agents
Canonical reference. 82% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
The same behavioral signals in LLM-based software engineering agents correlate with task success in opposite directions across different frameworks, with framework identity explaining more variance than the underlying LLM.
BluesFL uses block-level instruction-oriented slicing with LLMs to localize 24 bugs at Top-1 in a 19K-line RISC-V processor, a 242.9% gain over prior SOTA of 7 bugs.
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
SmellBench is the first benchmark showing LLM agents resolve 47.7% of architectural code smells while accurately spotting false positives, but aggressive repairs often introduce new smells and degrade overall quality.
TACT identifies drift axes in residual stream activations separating overthinking, overacting, and calibrated steps, then steers test-time activations toward the calibrated region to raise resolve rates by 4.8-5.8 pp and cut steps by up to 26% on coding benchmarks.
CGFuse enables deep token-level fusion of graph-derived structural features into language models, yielding 10-16% BLEU and 6-11% CodeBLEU gains on code generation tasks.
ARISE adds a data-flow-augmented repository graph and three-tier tool API to LLM agents, raising Function Recall@1 by 17 points, Line Recall@1 by 15 points, and Pass@1 repair rate to 22% on SWE-bench Lite.
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
The first empirical study of test overfitting shows that auto-generated tests from issues can lead to code that passes observed tests but misses important cases or breaks functionality in SWE-bench issue resolution.
MURPHY improves code generation pass rates by up to 6% through retrospective credit assignment on multi-turn feedback trees using max or mean reward propagation.
JETO-Mine is a reusable three-phase pipeline that mines 1.8 million Java commits to produce JETO-Bench containing 91 verified executable ETIPs, on which OpenHands succeeds at 14.3%.
FeatX extracts epic-feature hierarchies with code mappings from repositories and applies feature edits via a three-stage Evolution Agent, reporting 42.6% relative F1 gain in function-level localization and lower cognitive load versus vanilla ChatGPT in a user study and 38-commit replay.
Loc2Repair framework evaluation finds that file-level localization boosts LLM repo repair resolved rates by up to 7.7 percentage points on SWE-bench Verified.
Empirical analysis of LLM repair agents shows execution provides concentrated benefits, with restrictions causing only a 1.25 pp non-significant drop in resolve rate while cutting token and time costs.
FuzzAgent deploys specialized agents that collaborate on harness generation, execution, and crash triage to evolve fuzzing campaigns, delivering 45-191% more branch coverage than four baselines on 20 C/C++ libraries and surfacing 102 real bugs.
PROBE turns runtime telemetry from failed software engineering agent runs into evidence-grounded diagnoses and actionable recovery guidance, achieving 65.37% diagnosis accuracy and 21.79% recovery rate on 257 cases.
EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
Introduces the first benchmark for Java reproduction test generation from repository issues and adapts a prior Python tool to produce high performance on it.
A pre-registered bibliometric audit of 18,574 LLM papers finds a median 10.85 ECI lag behind the contemporaneous frontier, widening at 5.53 ECI/year, with only 3.2% of abstracts disclosing reasoning mode.
citing papers explorer
-
Detecting Privilege Escalation in Polyglot Microservices via Agentic Program Analysis
Neo combines LLM-based agents with code search primitives to detect privilege escalation in polyglot microservices, reporting 81% precision and 85% recall while uncovering 24 zero-day vulnerabilities across 25 applications.