The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents
Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3
The pith
OpenHands SDK delivers a composable agent interface that combines sandboxed execution, lifecycle control, and model-agnostic LLM routing for production software agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The OpenHands Software Agent SDK supplies a practical foundation for prototyping and scaling software agents by offering a simple default interface that extends to full-featured implementations, integrated REST and WebSocket services for reliability, and user-facing connections to visual workspaces and APIs, all while delivering measurable reductions in production failures and consistent benchmark performance.
What carries the argument
A minimal agent interface that requires only a few lines of code in the base case yet supports extensibility through custom tools, memory management, and model-agnostic multi-LLM routing, paired with native sandboxed execution and lifecycle control.
If this is right
- Agents can be implemented with minimal code and extended for custom tools or memory without rewriting core components.
- Execution moves seamlessly between local and remote environments while preserving security through sandboxing.
- Agents connect directly to user interfaces including IDEs, browsers, and command lines without additional adapters.
- Production deployments experience fewer system-attributable failures after the redesign with only negligible event-sourcing cost.
- Performance remains strong when evaluated across multiple models and standard software-agent benchmarks.
Where Pith is reading between the lines
- Teams could rapidly prototype new agent workflows that combine execution safety with flexible model choice without building infrastructure from scratch.
- The architecture may support larger-scale deployments where agents interact with real codebases over extended sessions while maintaining auditability.
- Security analysis built into the runtime could surface issues earlier in the development cycle compared with post-hoc checks.
- Model-agnostic routing might allow organizations to switch providers based on cost or capability without re-architecting agent logic.
Load-bearing premise
The production deployment data and benchmark evaluations are representative and free of selection effects or unmeasured variables that could explain the reported reduction in failures and strong performance.
What would settle it
A controlled production deployment or benchmark rerun in which version 1 shows no reduction in system-attributable failures or fails to maintain strong performance relative to version 0 under matched conditions.
read the original abstract
Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VSCode, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. We validate the architecture empirically: production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead, and evaluations across multiple models and benchmarks demonstrate strong agent performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the OpenHands Software Agent SDK as a toolkit for implementing software development agents, featuring a simple yet extensible interface, seamless local-to-remote execution, integrated services, and connections to various user interfaces. It positions the SDK as a redesign of the OpenHands framework and claims unique integration of native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis compared to SDKs from OpenAI, Claude, and Google. The architecture is validated empirically through production deployment data showing substantial reduction in system-attributable failures from V0 to V1 with negligible event-sourcing overhead, and through evaluations demonstrating strong agent performance across multiple models and benchmarks.
Significance. If the reported performance improvements and design advantages hold, the OpenHands SDK could provide a significant open-source contribution to the development of reliable and secure software engineering agents. By addressing flexibility, security, and usability in a composable manner, it has the potential to lower barriers for creating custom production agents and influence best practices in the field of AI-assisted software development.
major comments (1)
- [Abstract] The empirical validation sentence asserts that 'production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead' without providing quantitative results, error analysis, benchmark details, sample sizes, deployment periods, or controls for confounding factors such as changes in the user base, task distribution, models, or infrastructure. This before-after contrast is central to supporting the claim of a 'practical foundation for ... reliably deploying agents at scale,' but the lack of methodological details makes it difficult to attribute the failure reduction specifically to the V1 redesign rather than other unmeasured variables.
minor comments (1)
- The abstract could more explicitly reference the sections where the production data and benchmark evaluations are detailed to guide readers.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. We address the single major comment below and will revise the manuscript to improve the clarity and support for our empirical claims.
read point-by-point responses
-
Referee: [Abstract] The empirical validation sentence asserts that 'production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead' without providing quantitative results, error analysis, benchmark details, sample sizes, deployment periods, or controls for confounding factors such as changes in the user base, task distribution, models, or infrastructure. This before-after contrast is central to supporting the claim of a 'practical foundation for ... reliably deploying agents at scale,' but the lack of methodological details makes it difficult to attribute the failure reduction specifically to the V1 redesign rather than other unmeasured variables.
Authors: We agree that the abstract presents a high-level summary of the production deployment results without the quantitative details, methodological information, or discussion of potential confounders that would allow readers to fully evaluate the attribution to the V1 redesign. The full manuscript contains a dedicated empirical evaluation section that reports the specific failure-rate reductions, event-sourcing overhead measurements, deployment periods, and related analysis. To address the concern, we will revise the abstract to either incorporate key quantitative highlights (subject to length constraints) or explicitly reference the detailed evaluation section, and we will ensure the claims are appropriately qualified with respect to observational nature of the data and any unmeasured variables. revision: yes
Circularity Check
No circularity: software toolkit description with direct empirical validation
full rationale
The paper presents the OpenHands Software Agent SDK as an architectural redesign with features for flexibility, security, and interaction. Its central claims rest on direct descriptions of the interface, portability, and services, plus empirical statements about production deployment data and benchmark performance. No equations, fitted parameters, predictions, or first-principles derivations appear that could reduce to inputs by construction. Self-citations are absent from the provided text, and the validation sentences report observed outcomes rather than renaming or smuggling prior results. The derivation chain is therefore self-contained as a software engineering contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard assumptions about software agent reliability, security requirements, and execution portability in production environments hold for the described architecture.
Forward citations
Cited by 10 Pith papers
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
SkCC compiles LLM skills via SkIR to achieve portability across agent frameworks, reduce adaptation effort from O(m×n) to O(m+n), and enforce security with reported gains in task success rates and token efficiency.
-
SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents
SkCC compiles LLM agent skills through a strongly-typed IR and static security checks, cutting adaptation complexity from O(m×n) to O(m+n) and raising pass rates by 12-13 points on tested platforms.
-
ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...
-
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
-
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.
-
Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development
Agentic Agile-V uses Agile-V as backbone and a Specify-Constrain-Orchestrate-Prove-Evolve-Verify loop to convert AI agent conversations into traceable engineering artifacts with acceptance evidence.
-
Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models
A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.
Reference graph
Works this paper leans on
-
[1]
Claude agent sdk: Overview and python sdk
Anthropic. Claude agent sdk: Overview and python sdk. https://anthropic.mintlify.app/en/api/agent-sdk/overview, 2025 a . Accessed: 2025-10-29
work page 2025
-
[2]
Claude code: An agentic coding tool that lives in your terminal, 2025 b
Anthropic. Claude code: An agentic coding tool that lives in your terminal, 2025 b . URL https://github.com/anthropics/claude-code. GitHub repository, accessed 2025-10-28
work page 2025
-
[3]
Litellm: Call 100+ llm apis in openai format
BerriAI . Litellm: Call 100+ llm apis in openai format. https://github.com/BerriAI/litellm, 2024. Accessed: 2025-01-06
work page 2024
-
[4]
Cognition AI . Devin: Ai software engineer. https://www.cognition.ai/devin, 2024. Accessed: 2025-01-06
work page 2024
-
[5]
Cursor: The ai-first code editor
Cursor Team . Cursor: The ai-first code editor. https://www.cursor.com, 2024. Accessed: 2025-01-06
work page 2024
-
[6]
Github copilot: Your ai pair programmer
GitHub . Github copilot: Your ai pair programmer. https://github.com/features/copilot, 2021. Accessed: 2025-01-06
work page 2021
-
[7]
Google. Agent development kit (adk). https://google.github.io/adk-docs/, 2025. Accessed: 2025-10-29
work page 2025
-
[8]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66
work page 2024
-
[9]
Langchain: Runnables and the langchain expression language (lcel)
LangChain. Langchain: Runnables and the langchain expression language (lcel). https://api.python.langchain.com/en/latest/core/runnables.html, 2025. Accessed: 2025-10-29
work page 2025
-
[10]
Agents.md: The readme for your ai coding agents
Guangya Liu. Agents.md: The readme for your ai coding agents. https://research.aimultiple.com/agents-md/, August 2025. Accessed: 2025-01-06
work page 2025
-
[11]
Model context protocol (mcp)? https://modelcontextprotocol.io, 2025
MCP Team . Model context protocol (mcp)? https://modelcontextprotocol.io, 2025. Accessed: 2025-10-02
work page 2025
-
[12]
Gaia: a benchmark for general ai assistants
Gr \'e goire Mialon, Cl \'e mentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023 a
work page 2023
-
[13]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023 b . URL https://arxiv.org/abs/2311.12983
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
One year of openhands: A journey of open source ai development
Graham Neubig. One year of openhands: A journey of open source ai development. All Hands AI Blog, March 2025. URL https://www.all-hands.dev/blog/one-year-of-openhands-a-journey-of-open-source-ai-development
work page 2025
-
[15]
OpenAI . Openai agents sdk. https://github.com/openai/openai-agents-python, 2024. Accessed: 2025-01-06
work page 2024
-
[16]
OpenAI. Agents sdk and guide. https://platform.openai.com/docs/guides/agents, 2025. Accessed: 2025-10-29
work page 2025
-
[17]
Pydantic Team . Unions — discriminated unions. https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions, 2025. Accessed: 2025-10-29
work page 2025
-
[18]
Artificial Intelligence: A Modern Approach
Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, USA, 3rd edition, 2009. ISBN 0136042597
work page 2009
-
[19]
Openhands context condensensation for more efficient ai agents
Calvin Smith. Openhands context condensensation for more efficient ai agents. All Hands AI Blog, April 2025. URL https://openhands.dev/blog/openhands-context-condensensation-for-more-efficient-ai-agents
work page 2025
-
[20]
Coding agents with multimodal browsing are generalist problem solvers,
Aditya Bharat Soni, Boxuan Li, Xingyao Wang, Valerie Chen, and Graham Neubig. Coding agents with multimodal browsing are generalist problem solvers, 2025. URL https://arxiv.org/abs/2506.03011
-
[21]
Langgraph documentation (durable execution, deployment, server/cloud)
LangGraph Team. Langgraph documentation (durable execution, deployment, server/cloud). https://docs.langchain.com/oss/python/langgraph/, 2025. Accessed: 2025-10-29
work page 2025
-
[22]
A Survey on Large Language Model based Autonomous Agents
Lei Wang, Chengbang Ma, Xueyang Feng, Zeyu Zhang, Hao ran Yang, Jingsen Zhang, Zhi-Yang Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji rong Wen. A survey on large language model based autonomous agents. ArXiv, abs/2308.11432, 2023. URL https://api.semanticscholar.org/CorpusID:261064713
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI soft...
work page 2025
-
[24]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.