hub

arXiv preprint arXiv:2404.05427 (2024)

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, Abhik Roychoudhury · 2024 · arXiv 2404.05427

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.

An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

ABTest: Behavior-Driven Testing for AI Coding Agents

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.

Dynamic analysis enhances issue resolution

cs.SE · 2026-03-23 · conditional · novelty 7.0

DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

cs.SE · 2025-11-02 · unverdicted · novelty 7.0

Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

cs.CL · 2024-10-09 · unverdicted · novelty 7.0

MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

cs.AI · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.

Process-Centric Analysis of Agentic Software Systems

cs.SE · 2025-12-02 · unverdicted · novelty 6.0

Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.

Agentless: Demystifying LLM-based Software Engineering Agents

cs.SE · 2024-07-01 · conditional · novelty 6.0

Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.

Coding Agent Is Good As World Simulator

cs.AI · 2026-05-14

citing papers explorer

Showing 14 of 14 citing papers.

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 11
Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements cs.SE · 2026-05-17 · unverdicted · none · ref 57
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents cs.SE · 2026-05-13 · unverdicted · none · ref 7
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor cs.SE · 2026-04-07 · unverdicted · none · ref 93
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
ABTest: Behavior-Driven Testing for AI Coding Agents cs.SE · 2026-04-03 · unverdicted · none · ref 31
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
Dynamic analysis enhances issue resolution cs.SE · 2026-03-23 · conditional · none · ref 27
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems cs.SE · 2025-11-02 · unverdicted · none · ref 70
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 66
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 31
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 111 · 2 links
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents cs.AI · 2026-05-08 · unverdicted · none · ref 33
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
Process-Centric Analysis of Agentic Software Systems cs.SE · 2025-12-02 · unverdicted · none · ref 57
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Agentless: Demystifying LLM-based Software Engineering Agents cs.SE · 2024-07-01 · conditional · none · ref 114
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
Coding Agent Is Good As World Simulator cs.AI · 2026-05-14 · unreviewed · ref 34

arXiv preprint arXiv:2404.05427 (2024)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer