A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
hub
arXiv preprint arXiv:2406.01304 (2024)
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
OrchestrXR uses multi-agent orchestration with structured schemas to generate Unity XR study prototypes from ideas, supported by a user study with 12 researchers indicating effective support and intent preservation.
Extends density evolution to role-typed factor graphs with nonlinear Boolean verifiers to predict asymptotic unresolved subclaims in AI agent networks under three erasure failure modes.
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
NN-RAG extracts 1,289 candidate neural modules from 19 PyTorch repositories, validates 941 of them, and supplies roughly 72% of the novel structures in the LEMUR dataset while enabling cross-repository migration.
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
Non-linear domain-scoped parallel LLM agents achieve higher micro F1 than linear exploration and some baselines for multi-file change localization on SWE-bench Pro ansible tasks.
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
citing papers explorer
-
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
-
OrchestrXR: A Multi-Agent System for Idea-to-Prototype XR Study Authoring
OrchestrXR uses multi-agent orchestration with structured schemas to generate Unity XR study prototypes from ideas, supported by a user study with 12 researchers indicating effective support and intent preservation.
-
On the Reliability of Networks of AI Agents: Density Evolution, Stopping Sets, and Architecture Optimization
Extends density evolution to role-typed factor graphs with nonlinear Boolean verifiers to predict asymptotic unresolved subclaims in AI agent networks under three erasure failure modes.
-
Agentic Coding Needs Proactivity, Not Just Autonomy
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
-
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
-
REAgent: Requirement-Driven LLM Agents for Software Issue Resolution
REAgent improves LLM patch generation for software issues by 17.4% on average through automated construction, quality checking, and iterative refinement of structured issue-oriented requirements.
-
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
-
A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks
NN-RAG extracts 1,289 candidate neural modules from 19 PyTorch repositories, validates 941 of them, and supplies roughly 72% of the novel structures in the LEMUR dataset while enabling cross-repository migration.
-
Process-Centric Analysis of Agentic Software Systems
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
-
Agentless: Demystifying LLM-based Software Engineering Agents
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
-
Exploration Structure in LLM Agents for Multi-File Change Localization
Non-linear domain-scoped parallel LLM agents achieve higher micro F1 than linear exploration and some baselines for multi-file change localization on SWE-bench Pro ansible tasks.
-
What makes a harness a harness: necessary and sufficient conditions for an agent harness
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
-
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
-
From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap
A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.
-
Large Language Model-Based Agents for Software Engineering: A Survey
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
- iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation