hub Canonical reference

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, Lingming Zhang · 2024 · cs.SE · arXiv 2407.01489

Canonical reference. 75% of citing Pith papers cite this work as background.

58 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 58 citing papers arXiv PDF

abstract

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic three-phase process of localization, repair, and patch validation, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (32.00%, 96 correct fixes) and low cost ($0.70) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 2 method 1

citation-polarity summary

background 9 baseline 2 use method 1

claims ledger

abstract Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the fo

co-cited works

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

cs.AR · 2026-05-13 · unverdicted · novelty 7.0

Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

cs.SE · 2026-05-13 · conditional · novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.

CrackMeBench: Binary Reverse Engineering for Agents

cs.SE · 2026-05-11 · accept · novelty 7.0

CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.

Constraint Decay: The Fragility of LLM Agents in Backend Code Generation

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.

ProgramBench: Can Language Models Rebuild Programs From Scratch?

cs.SE · 2026-05-05 · unverdicted · novelty 7.0

ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

cs.SE · 2026-05-02 · unverdicted · novelty 7.0

Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49%

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.

Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Neurosymbolic Repo-level Code Localization

cs.SE · 2026-04-17 · unverdicted · novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.

Evaluating LLMs Code Reasoning Under Real-World Context

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.

An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

Dynamic analysis enhances issue resolution

cs.SE · 2026-03-23 · conditional · novelty 7.0

DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

cs.SE · 2026-03-04 · unverdicted · novelty 7.0

Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.

AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification

cs.SE · 2026-02-03 · conditional · novelty 7.0

AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.

Agentic Much? Adoption of Coding Agents on GitHub

cs.SE · 2026-01-26 · conditional · novelty 7.0

Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.

Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

cs.HC · 2026-01-17 · unverdicted · novelty 7.0

Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

cs.SE · 2025-12-20 · unverdicted · novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

cs.SE · 2025-05-27 · unverdicted · novelty 7.0

Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.

citing papers explorer

Showing 50 of 58 citing papers.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 75 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 19 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench cs.AR · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents cs.SE · 2026-05-13 · unverdicted · none · ref 6 · internal anchor
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization cs.SE · 2026-05-13 · unverdicted · none · ref 43 · internal anchor
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation cs.SE · 2026-05-13 · conditional · none · ref 39 · internal anchor
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
CrackMeBench: Binary Reverse Engineering for Agents cs.SE · 2026-05-11 · accept · none · ref 9 · internal anchor
CrackMeBench introduces 20 deterministic binary validation tasks and reports GPT-5.5 solving 11/12 generated ones at pass@3 while Claude and Kimi lag, especially on harder tasks.
Constraint Decay: The Fragility of LLM Agents in Backend Code Generation cs.SE · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
LLM agents exhibit constraint decay with assertion pass rates dropping substantially as structural requirements increase in multi-file backend code generation across web frameworks.
ProgramBench: Can Language Models Rebuild Programs From Scratch? cs.SE · 2026-05-05 · unverdicted · none · ref 3 · internal anchor
ProgramBench introduces 200 tasks where models must reconstruct full programs like FFmpeg or SQLite from docs alone; none of 9 evaluated LMs fully solve any task and the best passes 95% tests on only 3% of tasks while favoring monolithic code.
Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey cs.SE · 2026-05-02 · unverdicted · none · ref 44 · internal anchor
Developers use LLMs like ChatGPT mainly for knowledge acquisition and code generation at the detailed design level, reporting benefits such as better technology selection and early flaw detection alongside limitations like lengthy outputs, incorrect code, and hallucinations.
Social Bias in LLM-Generated Code: Benchmark and Mitigation cs.SE · 2026-05-01 · unverdicted · none · ref 165 · internal anchor
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Context-Augmented Code Generation: How Product Context Improves AI Coding Agent Decision Compliance by 49% cs.SE · 2026-04-27 · unverdicted · none · ref 13 · internal anchor
Adding product context retrieval to AI coding agents raises decision compliance from 46% to 95% on a new benchmark of 8 tasks with 41 weighted decision points.
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis cs.SE · 2026-04-27 · unverdicted · none · ref 52 · internal anchor
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 40 · internal anchor
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
Neurosymbolic Repo-level Code Localization cs.SE · 2026-04-17 · unverdicted · none · ref 32 · internal anchor
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Evaluating LLMs Code Reasoning Under Real-World Context cs.SE · 2026-04-14 · unverdicted · none · ref 25 · internal anchor
R2Eval is a new benchmark with 135 real-world code reasoning problems from Python projects that preserves complex data structures for more realistic LLM evaluation.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor cs.SE · 2026-04-07 · unverdicted · none · ref 92 · internal anchor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
Dynamic analysis enhances issue resolution cs.SE · 2026-03-23 · conditional · none · ref 22 · internal anchor
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development cs.SE · 2026-03-04 · unverdicted · none · ref 23 · internal anchor
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification cs.SE · 2026-02-03 · conditional · none · ref 45 · internal anchor
AgenticSZZ reframes bug-inducing commit identification as temporal knowledge graph search navigated by an LLM agent, reporting F1 scores of 0.47-0.79 and up to 34% improvement over prior SZZ methods on three datasets.
Agentic Much? Adoption of Coding Agents on GitHub cs.SE · 2026-01-26 · conditional · none · ref 41 · internal anchor
Coding agents reached 22-29% adoption in GitHub projects within months of release, with agent-assisted commits larger and focused on features and bug fixes.
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI cs.HC · 2026-01-17 · unverdicted · none · ref 85 · internal anchor
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios cs.SE · 2025-12-20 · unverdicted · none · ref 58 · internal anchor
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Code Researcher: Deep Research Agent for Large Systems Code and Commit History cs.SE · 2025-05-27 · unverdicted · none · ref 39 · internal anchor
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving cs.SE · 2025-04-03 · unverdicted · none · ref 18 · internal anchor
Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 57 · internal anchor
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents cs.SE · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
P2T distills reference patches into a latent process graph and uses it to select shortest effective trajectory segments from teacher rollouts, yielding up to 10.8 point Pass@1 gains on SWE-bench Verified with 15% lower inference cost using only 1.8k instances.
Revisiting DAgger in the Era of LLM-Agents cs.LG · 2026-05-13 · conditional · none · ref 10 · internal anchor
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 106 · 2 links · internal anchor
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents cs.AI · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution cs.LG · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
SWE Atlas is a benchmark suite for coding agents that evaluates Codebase Q&A, Test Writing, and Refactoring using comprehensive protocols assessing both functional correctness and software engineering quality.
Neuro-Symbolic Agents for Hallucination-Free Requirements Reuse cs.SE · 2026-05-02 · unverdicted · none · ref 31 · internal anchor
A neuro-symbolic agent system for requirements reuse achieves 100% coverage and 0.2% constraint violations by construction through symbolic enforcement of an OOMRAM lattice.
TypeScript Repository Indexing for Code Agent Retrieval cs.SE · 2026-04-20 · unverdicted · none · ref 16 · internal anchor
abcoder-ts-parser builds reliable function-level code indexes for large TypeScript repositories significantly faster by using the compiler's native AST and semantic resolution instead of per-symbol language server calls.
AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection cs.SE · 2026-04-13 · conditional · none · ref 61 · internal anchor
AnyPoC introduces a multi-agent system for generating and validating PoC tests from LLM bug reports, producing 1.3x more valid PoCs, rejecting 9.8x more false positives, and discovering 122 new bugs across 12 major projects.
GALA: Multimodal Graph Alignment for Bug Localization in Automated Program Repair cs.SE · 2026-04-09 · unverdicted · none · ref 39 · internal anchor
GALA uses hierarchical graph alignment between UI screenshots and code structures to achieve state-of-the-art bug localization in multimodal automated program repair on SWE-bench.
On the Role of Fault Localization Context for LLM-Based Program Repair cs.SE · 2026-04-07 · unverdicted · none · ref 37 · internal anchor
More fault localization context does not consistently improve LLM-based program repair; file-level context gives 15-17x gains, optimal around 6-10 files, while line-level context often degrades performance from noise.
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints cs.SE · 2026-04-06 · unverdicted · none · ref 49 · internal anchor
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
Toward Training Superintelligent Software Agents through Self-Play SWE-RL cs.SE · 2025-12-21 · unverdicted · none · ref 50 · internal anchor
Self-play RL on bug injection and repair in sandboxed repositories yields +10.4 and +7.8 point gains on SWE-bench Verified and Pro while outperforming human-data baselines.
Process-Centric Analysis of Agentic Software Systems cs.SE · 2025-12-02 · unverdicted · none · ref 54 · internal anchor
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Knowledge-Graph-Driven Data Synthesis for Low-Resource Software Development: A HarmonyOS Case Study cs.SE · 2025-11-29 · unverdicted · none · ref 68 · internal anchor
APIKG4Syn synthesizes API-oriented training data via knowledge graphs and Monte Carlo search to fine-tune a 7B model that reaches 25% pass@1 on HarmonyOS code generation, beating untuned GPT-4o at 17.59%.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory cs.CL · 2025-11-25 · unverdicted · none · ref 223 · internal anchor
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
The Command Line GUIde: Graphical Interfaces from Man Pages via AI cs.HC · 2025-10-01 · unverdicted · none · ref 15 · 2 links · internal anchor
GUIde uses AI to translate man pages into graphical interface specifications for command line tools, evaluated on a corpus of real commands.
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? cs.SE · 2025-09-21 · conditional · none · ref 14 · internal anchor
SWE-Bench Pro is a new benchmark with 1,865 long-horizon tasks from 41 repositories designed to evaluate AI agents on realistic enterprise-level software engineering problems beyond prior benchmarks.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 45 · internal anchor
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair cs.SE · 2025-06-12 · conditional · none · ref 35 · internal anchor
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios cs.SE · 2025-03-16 · accept · none · ref 70 · internal anchor
Empirical study of 3977 agent trajectories finds Python execution errors correlate with lower success rates on GitHub issues, flags challenging errors, and reports three confirmed bugs in the SWE-Bench platform.
GEAR: Genetic AutoResearch for Agentic Code Evolution cs.NE · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant cs.SE · 2026-04-26 · unverdicted · none · ref 24 · internal anchor
KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent history, and git worktree isolation while self-validating outputs.
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems cs.SE · 2026-04-20 · unverdicted · none · ref 43 · internal anchor
AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
Sema Code: Decoupling AI Coding Agents into Programmable, Embeddable Infrastructure cs.SE · 2026-04-13 · unverdicted · none · ref 9 · internal anchor
Sema Code decouples AI coding agents into a programmable npm library with eight mechanisms for isolation, queuing, compression, scheduling, permissions, and integration.

Agentless: Demystifying LLM-based Software Engineering Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer