hub

AutoCodeRover : Autonomous Program Improvement , July 2024

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, Abhik Roychoudhury · 2024 · arXiv 2404.05427

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

LLM Agents Can See Code Repositories

cs.SE · 2026-06-12 · unverdicted · novelty 7.0

Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

cs.LG · 2026-06-10 · conditional · novelty 7.0

Claw-SWE-Bench is a 350-instance multilingual benchmark for OpenClaw-style agent harnesses that shows adapter design raises Pass@1 from 19.1% to 73.4% on the same model while releasing data for reproducible comparison.

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.

An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.

ABTest: Behavior-Driven Testing for AI Coding Agents

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

cs.SE · 2025-11-02 · unverdicted · novelty 7.0

Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

cs.SE · 2025-02-25 · unverdicted · novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

cs.CL · 2024-10-09 · unverdicted · novelty 7.0

MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.

ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation

cs.SE · 2026-07-01 · unverdicted · novelty 6.0

ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

cs.AI · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.

SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.

Process-Centric Analysis of Agentic Software Systems

cs.SE · 2025-12-02 · unverdicted · novelty 6.0

Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.

Agentless: Demystifying LLM-based Software Engineering Agents

cs.SE · 2024-07-01 · conditional · novelty 6.0

Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.

Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation

cs.CV · 2026-06-21 · unverdicted · novelty 5.0

Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three video models.

Dissecting model behavior through agent trajectories

cs.AI · 2026-06-16 · unverdicted · novelty 5.0

SSA harness matches frontier model pass@1 scores on agent benchmarks and 138k trajectory analysis in code state-spaces shows model-specific differences in edit frequency, testing activity, and phase transitions.

ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair

cs.AI · 2026-07-02 · unverdicted · novelty 4.0

ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.

Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent

cs.AI · 2026-06-21 · unverdicted · novelty 4.0

Ablation study finds that a structural codebase index improves localization and resolve rates in coding agents on two SWE benchmarks without raising per-cell cost.

What makes a harness a harness: necessary and sufficient conditions for an agent harness

cs.SE · 2026-06-08 · unverdicted · novelty 4.0

Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.

Coding Agent Is Good As World Simulator

cs.AI · 2026-05-14 · unverdicted · novelty 4.0 · 2 refs

An agentic framework generates executable physics simulation code from text prompts via coordinated planning, coding, visual, and physics agents that iterate to satisfy both prompt fidelity and physical constraints.

Dynamic analysis enhances issue resolution

cs.SE · 2026-03-23

citing papers explorer

Showing 19 of 19 citing papers after filters.

LLM Agents Can See Code Repositories cs.SE · 2026-06-12 · unverdicted · none · ref 45
Visual graphs of repository structure added to text inputs for multimodal LLM agents reduce token consumption by up to 26% while maintaining or improving issue-resolution accuracy.
Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 11
Proposes Formal Skill as a programmable runtime abstraction for LLM agents, implemented in open-source FairyClaw, achieving competitive Harness-Bench scores with substantially fewer tokens.
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements cs.SE · 2026-05-17 · unverdicted · none · ref 57
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
AI Harness Engineering: A Runtime Substrate for Foundation-Model Software Agents cs.SE · 2026-05-13 · unverdicted · none · ref 7
The paper defines AI Harness Engineering as a runtime substrate with eleven components and a four-level ladder that reframes agent reliability as a model-harness-environment system property rather than model capability alone.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor cs.SE · 2026-04-07 · unverdicted · none · ref 93
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
ABTest: Behavior-Driven Testing for AI Coding Agents cs.SE · 2026-04-03 · unverdicted · none · ref 31
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems cs.SE · 2025-11-02 · unverdicted · none · ref 70
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 66
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 31
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
ClarifyCodeBench: Evaluating LLMs on Clarifying Ambiguous Requirements for Code Generation cs.SE · 2026-07-01 · unverdicted · none · ref 40
ClarifyCodeBench is a new benchmark with manual annotations and two metrics showing that LLMs strong at code generation are weak at clarifying ambiguous requirements, with performance worsening as ambiguity density rises.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models cs.AI · 2026-05-09 · unverdicted · none · ref 111 · 2 links
BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
SkillLens: Adaptive Multi-Granularity Skill Reuse for Cost-Efficient LLM Agents cs.AI · 2026-05-08 · unverdicted · none · ref 33
SkillLens organizes skills into policies-strategies-procedures-primitives layers, retrieves via degree-corrected random walk, and uses a verifier for local adaptation, yielding up to 6.31 pp gains on MuLocbench and raising ALFWorld success from 45% to 51.31%.
Process-Centric Analysis of Agentic Software Systems cs.SE · 2025-12-02 · unverdicted · none · ref 57
Graphectory turns stochastic agent trajectories into analyzable graphs, showing that stronger models and successful fixes follow coherent localization-validation steps while failures are chaotic, and online detection plus rollback improves resolution rates by 6.9-23.5%.
Sol Video Inference Engine: Agent-Native Full-Stack Acceleration Framework for Efficient Video Generation cs.CV · 2026-06-21 · unverdicted · none · ref 42
Sol Video Inference Engine uses parallel skill agents to optimize cache, sparse attention, token pruning, quantization, and kernel fusion, delivering over 2x end-to-end acceleration with near-lossless quality on three video models.
Dissecting model behavior through agent trajectories cs.AI · 2026-06-16 · unverdicted · none · ref 56
SSA harness matches frontier model pass@1 scores on agent benchmarks and 138k trajectory analysis in code state-spaces shows model-specific differences in edit frequency, testing activity, and phase transitions.
ContextSniper: AntTrail's Token-Efficient Code Memory for Repository-Level Program Repair cs.AI · 2026-07-02 · unverdicted · none · ref 3
ContextSniper reduces token use by 38.9-51.5% in repository-level program repair agents on SWE-bench Lite with 2 percentage point drops in resolution rate.
Code Isn't Memory: A Structural Codebase Index Inside a Coding Agent cs.AI · 2026-06-21 · unverdicted · none · ref 11
Ablation study finds that a structural codebase index improves localization and resolve rates in coding agents on two SWE benchmarks without raising per-cell cost.
What makes a harness a harness: necessary and sufficient conditions for an agent harness cs.SE · 2026-06-08 · unverdicted · none · ref 64
Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
Coding Agent Is Good As World Simulator cs.AI · 2026-05-14 · unverdicted · none · ref 34 · 2 links
An agentic framework generates executable physics simulation code from text prompts via coordinated planning, coding, visual, and physics agents that iterate to satisfy both prompt fidelity and physical constraints.

AutoCodeRover : Autonomous Program Improvement , July 2024

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer