super hub Canonical reference

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Boxuan Li, Frank F. Xu, Mingchen Zhuge, Xiangru Tang, Xingyao Wang, Yufan Song · 2024 · cs.SE · arXiv 2407.16741

Canonical reference. 79% of citing Pith papers cite this work as background.

113 Pith papers citing it

Background 79% of classified citations

open full Pith review browse 113 citing papers more from Boxuan Li arXiv PDF

abstract

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 33 baseline 2 method 2 other 1

citation-polarity summary

background 30 unclear 4 baseline 2 use method 2

claims ledger

abstract Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with

authors

Boxuan Li Frank F. Xu Mingchen Zhuge Xiangru Tang Xingyao Wang Yufan Song

co-cited works

representative citing papers

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

cs.LG · 2026-05-11 · conditional · novelty 8.0

Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

cs.AI · 2026-05-10 · unverdicted · novelty 8.0

PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

cs.AI · 2026-05-10 · accept · novelty 8.0 · 2 refs

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

cs.CR · 2026-04-03 · unverdicted · novelty 8.0

DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Training Software Engineering Agents and Verifiers with SWE-Gym

cs.SE · 2024-12-30 · conditional · novelty 8.0

SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.

Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench

cs.AR · 2026-05-13 · unverdicted · novelty 7.0

Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.

Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation

cs.CL · 2026-05-12 · conditional · novelty 7.0 · 2 refs

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.

Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection

cs.CR · 2026-05-12 · unverdicted · novelty 7.0

Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

cs.CL · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

Weblica: Scalable and Reproducible Training Environments for Visual Web Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.

Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes

cs.OS · 2026-04-30 · unverdicted · novelty 7.0

Crab bridges the agent-OS semantic gap with an eBPF inspector, turn-aligned coordinator, and host engine to deliver 100% recovery correctness while cutting checkpoint traffic up to 87% and adding under 2% overhead.

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

cs.CL · 2026-04-28 · unverdicted · novelty 7.0 · 2 refs

AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.

Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

cs.AI · 2026-04-24 · unverdicted · novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).

Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

LLMVD.js uses LLM agents to confirm 84% of taint-style vulnerabilities on public benchmarks (vs. <22% for prior tools) and generates validated exploits for 36 of 260 new packages (vs. ≤2 for traditional tools).

citing papers explorer

Showing 50 of 113 citing papers.

Continual Harness: Online Adaptation for Self-Improving Foundation Agents cs.LG · 2026-05-11 · conditional · none · ref 21 · internal anchor
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation cs.AI · 2026-05-10 · unverdicted · none · ref 12 · internal anchor
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning cs.AI · 2026-05-10 · accept · none · ref 83 · 2 links · internal anchor
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks cs.AI · 2026-04-16 · unverdicted · none · ref 24 · internal anchor
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems cs.CR · 2026-04-03 · unverdicted · none · ref 45 · internal anchor
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 7 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Training Software Engineering Agents and Verifiers with SWE-Gym cs.SE · 2024-12-30 · conditional · none · ref 7 · internal anchor
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
MemGym: a Long-Horizon Memory Environment for LLM Agents cs.CL · 2026-05-20 · unverdicted · none · ref 45 · internal anchor
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation cs.SE · 2026-05-14 · unverdicted · none · ref 19 · internal anchor
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing cs.SE · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench cs.AR · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark cs.CV · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation cs.CL · 2026-05-12 · conditional · none · ref 28 · 2 links · internal anchor
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection cs.CR · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents cs.AI · 2026-05-11 · unverdicted · none · ref 198 · internal anchor
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems cs.CL · 2026-05-09 · unverdicted · none · ref 53 · 2 links · internal anchor
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
TeamBench: Evaluating Agent Coordination under Enforced Role Separation cs.AI · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents cs.AI · 2026-05-07 · unverdicted · none · ref 35 · internal anchor
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes cs.OS · 2026-04-30 · unverdicted · none · ref 48 · internal anchor
Crab bridges the agent-OS semantic gap with an eBPF inspector, turn-aligned coordinator, and host engine to deliver 100% recovery correctness while cutting checkpoint traffic up to 87% and adding under 2% overhead.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves cs.SE · 2026-04-29 · unverdicted · none · ref 34 · internal anchor
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses cs.CL · 2026-04-28 · unverdicted · none · ref 43 · 2 links · internal anchor
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis cs.SE · 2026-04-27 · unverdicted · none · ref 50 · internal anchor
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company cs.AI · 2026-04-24 · unverdicted · none · ref 55 · internal anchor
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning cs.CR · 2026-04-22 · unverdicted · none · ref 51 · internal anchor
LLMVD.js uses LLM agents to confirm 84% of taint-style vulnerabilities on public benchmarks (vs. <22% for prior tools) and generates validated exploits for 36 of 260 new packages (vs. ≤2 for traditional tools).
Neurosymbolic Repo-level Code Localization cs.SE · 2026-04-17 · unverdicted · none · ref 30 · internal anchor
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems cs.AI · 2026-04-13 · unverdicted · none · ref 55 · internal anchor
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 63 · internal anchor
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale cs.AI · 2026-04-11 · conditional · none · ref 43 · internal anchor
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation cs.SE · 2026-04-08 · unverdicted · none · ref 31 · internal anchor
R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.
Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents cs.SE · 2026-04-04 · unverdicted · none · ref 9 · internal anchor
A LoRA-fine-tuned Qwen 3.5 2B model for task-conditioned tool-output pruning reaches 0.86 recall and 0.80 F1 on a new 618-example test set while removing 92% of input tokens and outperforming larger zero-shot models.
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures cs.SE · 2026-04-03 · accept · none · ref 16 · internal anchor
Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
ABTest: Behavior-Driven Testing for AI Coding Agents cs.SE · 2026-04-03 · unverdicted · none · ref 21 · internal anchor
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents cs.AI · 2026-04-03 · unverdicted · none · ref 26 · internal anchor
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation cs.SE · 2026-04-03 · unverdicted · none · ref 41 · internal anchor
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits cs.SE · 2026-04-03 · conditional · none · ref 41 · internal anchor
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
Automating Database-Native Function Code Synthesis with LLMs cs.DB · 2026-04-02 · conditional · none · ref 46 · internal anchor
DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.
Dynamic analysis enhances issue resolution cs.SE · 2026-03-23 · conditional · none · ref 20 · internal anchor
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
FormulaCode: Evaluating Agentic Optimization on Large Codebases cs.SE · 2026-03-16 · unverdicted · none · ref 3 · internal anchor
FormulaCode is a new benchmark for repository-level LLM agent optimization using 957 mined bottlenecks, expert patches, and multi-objective metrics from real scientific Python repositories.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents cs.LG · 2026-03-13 · unverdicted · none · ref 28 · internal anchor
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development cs.SE · 2026-03-04 · unverdicted · none · ref 22 · internal anchor
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
VeRO: An Evaluation Harness for Agents to Optimize Agents cs.AI · 2026-02-25 · unverdicted · none · ref 24 · internal anchor
VeRO supplies a versioned harness, benchmark suite, and empirical comparison of optimizer configurations for coding agents that improve other agents.
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs? cs.SE · 2026-02-20 · conditional · none · ref 3 · internal anchor
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI cs.HC · 2026-01-17 · unverdicted · none · ref 75 · internal anchor
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios cs.SE · 2025-12-20 · unverdicted · none · ref 51 · internal anchor
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings cs.SE · 2025-09-15 · conditional · none · ref 51 · internal anchor
CodeCureAgent achieves 96.8% plausible fixes and 86.3% correct fixes for 1,000 SonarQube warnings across 106 Java projects using an agentic LLM framework.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution cs.SE · 2025-02-25 · unverdicted · none · ref 50 · internal anchor
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks cs.CL · 2024-12-18 · conditional · none · ref 2 · internal anchor
TheAgentCompany benchmark finds that the strongest LLM agents autonomously complete 30% of tasks in a simulated real-world software company environment.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 29 · internal anchor
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection cs.AI · 2026-05-22 · unverdicted · none · ref 7 · internal anchor
MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.
Towards Direct Evaluation of Harness Optimizers via Priority Ranking cs.AI · 2026-05-21 · unverdicted · none · ref 51 · internal anchor
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer