Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
super hub Canonical reference
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Canonical reference. 79% of citing Pith papers cite this work as background.
abstract
Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with
authors
co-cited works
representative citing papers
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
Crab bridges the agent-OS semantic gap with an eBPF inspector, turn-aligned coordinator, and host engine to deliver 100% recovery correctness while cutting checkpoint traffic up to 87% and adding under 2% overhead.
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
LLMVD.js uses LLM agents to confirm 84% of taint-style vulnerabilities on public benchmarks (vs. <22% for prior tools) and generates validated exploits for 36 of 260 new packages (vs. ≤2 for traditional tools).
citing papers explorer
-
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Continual Harness automates online self-improvement for foundation-model embodied agents by refining prompts, sub-agents, skills, and memory within one run, cutting button-press costs on Pokemon Red and Emerald and closing much of the gap to expert harnesses.
-
PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation
PDEAgent-Bench is the first multi-metric, multi-library benchmark for AI-generated PDE solvers, evaluating executability, numerical accuracy, and efficiency across DOLFINx, Firedrake, and deal.II.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks
HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.
-
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
DDIPE poisons LLM agent skills by embedding malicious logic in documentation examples, achieving 11.6-33.5% bypass rates across frameworks while explicit attacks are blocked, with 2.5% evading detection.
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Training Software Engineering Agents and Verifiers with SWE-Gym
SWE-Gym supplies 2438 executable real-world Python tasks to train SWE agents and verifiers, yielding up to 19% gains and new open-weight SOTA of 32% on SWE-Bench Verified.
-
MemGym: a Long-Horizon Memory Environment for LLM Agents
MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.
-
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
-
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
-
Is Agentic AI Ready for Real-World Hardware Engineering? A Deep Dive with Phoenix-bench
Phoenix-bench shows agentic AI systems lose 37-58% resolved rate when moving from SWE-bench Verified to hardware tasks because bugs spread across parallel modules via signal flow, with testbench feedback lifting performance by 42-45% while file-level oracles add only 1.4%.
-
Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Presents CUActSpot benchmark and renderer-LLM data synthesis that lets a 4B model outperform larger open-source models on complex computer interactions.
-
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
-
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
-
Weblica: Scalable and Reproducible Training Environments for Visual Web Agents
Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.
-
Crab: A Semantics-Aware Checkpoint/Restore Runtime for Agent Sandboxes
Crab bridges the agent-OS semantic gap with an eBPF inspector, turn-aligned coordinator, and host engine to deliver 100% recovery correctness while cutting checkpoint traffic up to 87% and adding under 2% overhead.
-
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
-
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
AHE automates coding-agent harness evolution via component, experience, and decision observability, raising Terminal-Bench 2 pass@1 from 69.7% to 77.0% with cross-benchmark and cross-model transfer.
-
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
-
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
-
Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning
LLMVD.js uses LLM agents to confirm 84% of taint-style vulnerabilities on public benchmarks (vs. <22% for prior tools) and generates validated exploits for 36 of 260 new packages (vs. ≤2 for traditional tools).
-
Neurosymbolic Repo-level Code Localization
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
-
Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems
A harness for AI agents enabled construction of a Rust library with 100+ problem types and 200+ reduction rules for NP-hard problems in three months.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale
TimeSeriesExamAgent combines templates and LLM agents to generate scalable time series reasoning benchmarks, demonstrating that current LLMs have limited performance on both abstract and domain-specific tasks.
-
Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation
R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.
-
Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents
A LoRA-fine-tuned Qwen 3.5 2B model for task-conditioned tool-output pruning reaches 0.86 recall and 0.80 F1 on a new 618-example test set while removing 92% of input tokens and outperforming larger zero-shot models.
-
Inside the Scaffold: A Source-Code Taxonomy of Coding Agent Architectures
Analysis of 13 coding agent scaffolds at pinned commits yields a 12-dimension taxonomy showing five composable loop primitives, with 11 agents combining multiple primitives instead of using one fixed structure.
-
ABTest: Behavior-Driven Testing for AI Coding Agents
ABTest mines 400 failure reports into 47 patterns and 128 actions to generate 647 tests that flag 642 new anomalies across three AI coding agents at 40.8% precision.
-
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
AgentHazard benchmark shows computer-use agents remain highly vulnerable, with attack success rates reaching 73.63% on models like Qwen3-Coder powering Claude Code.
-
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
-
AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.
-
Automating Database-Native Function Code Synthesis with LLMs
DBCooker automates synthesis of database native functions via LLM-guided characterization, coding plans, hybrid filling, and progressive validation, delivering 34.55% higher accuracy than baselines on SQLite, PostgreSQL, and DuckDB while generating functions absent from SQLite 3.50.
-
Dynamic analysis enhances issue resolution
DAIRA integrates dynamic tracing into LLM agents to achieve 79.4% resolution rate on SWE-bench Verified for code defect repair.
-
FormulaCode: Evaluating Agentic Optimization on Large Codebases
FormulaCode is a new benchmark for repository-level LLM agent optimization using 957 mined bottlenecks, expert patches, and multi-objective metrics from real scientific Python repositories.
-
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
-
Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Vibe Code Bench evaluates AI models on building complete web applications from specs, with the best of 16 models achieving 61.8% accuracy on the test split using autonomous browser evaluation.
-
VeRO: An Evaluation Harness for Agents to Optimize Agents
VeRO supplies a versioned harness, benchmark suite, and empirical comparison of optimizer configurations for coding agents that improve other agents.
-
Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?
Debug2Fix integrates interactive debugging via subagents into coding agents, delivering >20% gains on GitBug-Java and SWE-Bench-Live while enabling weaker models to match stronger ones.
-
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
CodeCureAgent: Automatic Classification and Repair of Static Analysis Warnings
CodeCureAgent achieves 96.8% plausible fixes and 86.3% correct fixes for 1,000 SonarQube warnings across 106 Java projects using an agentic LLM framework.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
TheAgentCompany benchmark finds that the strongest LLM agents autonomously complete 30% of tasks in a simulated real-world software company environment.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection
MemAudit combines counterfactual causal influence scores with memory consistency graphs to identify poisoned records in LLM agent memory, reducing MINJA attack success from 70% to 0% in QA and 83.3% to 0% in reasoning tasks.
-
Towards Direct Evaluation of Harness Optimizers via Priority Ranking
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.