Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
hub
arXiv preprint arXiv:2404.05427 (2024)
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
citing papers explorer
-
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
-
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
-
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
-
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
-
ABTest: Behavior-Driven Testing for AI Coding Agents
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
-
Dynamic analysis enhances issue resolution
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
-
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
-
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
-
Process-Centric Analysis of Agentic Software Systems
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
-
Agentless: Demystifying LLM-based Software Engineering Agents
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
- Coding Agent Is Good As World Simulator