pith. sign in

arxiv: 2511.03690 · v2 · submitted 2025-11-05 · 💻 cs.SE · cs.AI

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Pith reviewed 2026-05-18 00:52 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords OpenHandsSoftware Agent SDKproduction agentssandboxed executionmulti-LLM routingagent frameworksoftware development agentslifecycle control
0
0 comments X

The pith

OpenHands SDK delivers a composable agent interface that combines sandboxed execution, lifecycle control, and model-agnostic LLM routing for production software agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the OpenHands Software Agent SDK as a complete architectural redesign of the agent components in the OpenHands framework. It aims to meet the needs of production software development agents through a minimal yet extensible interface that supports custom tools and memory, seamless local-to-remote execution portability, and direct connections to interfaces such as VSCode, VNC, command lines, and APIs. The SDK is positioned as distinct from existing offerings by integrating native sandboxed execution, built-in security analysis, and multi-LLM routing without tying to a single model provider. Empirical validation shows version 1 reduces system-attributable failures compared with version 0 while adding negligible overhead, and benchmark results indicate strong performance across models.

Core claim

The OpenHands Software Agent SDK supplies a practical foundation for prototyping and scaling software agents by offering a simple default interface that extends to full-featured implementations, integrated REST and WebSocket services for reliability, and user-facing connections to visual workspaces and APIs, all while delivering measurable reductions in production failures and consistent benchmark performance.

What carries the argument

A minimal agent interface that requires only a few lines of code in the base case yet supports extensibility through custom tools, memory management, and model-agnostic multi-LLM routing, paired with native sandboxed execution and lifecycle control.

If this is right

  • Agents can be implemented with minimal code and extended for custom tools or memory without rewriting core components.
  • Execution moves seamlessly between local and remote environments while preserving security through sandboxing.
  • Agents connect directly to user interfaces including IDEs, browsers, and command lines without additional adapters.
  • Production deployments experience fewer system-attributable failures after the redesign with only negligible event-sourcing cost.
  • Performance remains strong when evaluated across multiple models and standard software-agent benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could rapidly prototype new agent workflows that combine execution safety with flexible model choice without building infrastructure from scratch.
  • The architecture may support larger-scale deployments where agents interact with real codebases over extended sessions while maintaining auditability.
  • Security analysis built into the runtime could surface issues earlier in the development cycle compared with post-hoc checks.
  • Model-agnostic routing might allow organizations to switch providers based on cost or capability without re-architecting agent logic.

Load-bearing premise

The production deployment data and benchmark evaluations are representative and free of selection effects or unmeasured variables that could explain the reported reduction in failures and strong performance.

What would settle it

A controlled production deployment or benchmark rerun in which version 1 shows no reduction in system-attributable failures or fails to maintain strong performance relative to version 0 under matched conditions.

read the original abstract

Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VSCode, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. We validate the architecture empirically: production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead, and evaluations across multiple models and benchmarks demonstrate strong agent performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents the OpenHands Software Agent SDK as a toolkit for implementing software development agents, featuring a simple yet extensible interface, seamless local-to-remote execution, integrated services, and connections to various user interfaces. It positions the SDK as a redesign of the OpenHands framework and claims unique integration of native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis compared to SDKs from OpenAI, Claude, and Google. The architecture is validated empirically through production deployment data showing substantial reduction in system-attributable failures from V0 to V1 with negligible event-sourcing overhead, and through evaluations demonstrating strong agent performance across multiple models and benchmarks.

Significance. If the reported performance improvements and design advantages hold, the OpenHands SDK could provide a significant open-source contribution to the development of reliable and secure software engineering agents. By addressing flexibility, security, and usability in a composable manner, it has the potential to lower barriers for creating custom production agents and influence best practices in the field of AI-assisted software development.

major comments (1)
  1. [Abstract] The empirical validation sentence asserts that 'production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead' without providing quantitative results, error analysis, benchmark details, sample sizes, deployment periods, or controls for confounding factors such as changes in the user base, task distribution, models, or infrastructure. This before-after contrast is central to supporting the claim of a 'practical foundation for ... reliably deploying agents at scale,' but the lack of methodological details makes it difficult to attribute the failure reduction specifically to the V1 redesign rather than other unmeasured variables.
minor comments (1)
  1. The abstract could more explicitly reference the sections where the production data and benchmark evaluations are detailed to guide readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address the single major comment below and will revise the manuscript to improve the clarity and support for our empirical claims.

read point-by-point responses
  1. Referee: [Abstract] The empirical validation sentence asserts that 'production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead' without providing quantitative results, error analysis, benchmark details, sample sizes, deployment periods, or controls for confounding factors such as changes in the user base, task distribution, models, or infrastructure. This before-after contrast is central to supporting the claim of a 'practical foundation for ... reliably deploying agents at scale,' but the lack of methodological details makes it difficult to attribute the failure reduction specifically to the V1 redesign rather than other unmeasured variables.

    Authors: We agree that the abstract presents a high-level summary of the production deployment results without the quantitative details, methodological information, or discussion of potential confounders that would allow readers to fully evaluate the attribution to the V1 redesign. The full manuscript contains a dedicated empirical evaluation section that reports the specific failure-rate reductions, event-sourcing overhead measurements, deployment periods, and related analysis. To address the concern, we will revise the abstract to either incorporate key quantitative highlights (subject to length constraints) or explicitly reference the detailed evaluation section, and we will ensure the claims are appropriately qualified with respect to observational nature of the data and any unmeasured variables. revision: yes

Circularity Check

0 steps flagged

No circularity: software toolkit description with direct empirical validation

full rationale

The paper presents the OpenHands Software Agent SDK as an architectural redesign with features for flexibility, security, and interaction. Its central claims rest on direct descriptions of the interface, portability, and services, plus empirical statements about production deployment data and benchmark performance. No equations, fitted parameters, predictions, or first-principles derivations appear that could reduce to inputs by construction. Self-citations are absent from the provided text, and the validation sentences report observed outcomes rather than renaming or smuggling prior results. The derivation chain is therefore self-contained as a software engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a software engineering toolkit paper. No free parameters, mathematical axioms, or new invented entities are introduced. The contribution rests on standard domain assumptions about agent execution environments and security needs in production software development.

axioms (1)
  • domain assumption Standard assumptions about software agent reliability, security requirements, and execution portability in production environments hold for the described architecture.
    Invoked implicitly when claiming reduced failures and practical foundation status in the abstract validation paragraph.

pith-pipeline@v0.9.0 · 5858 in / 1352 out tokens · 49894 ms · 2026-05-18T00:52:17.912164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  2. SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    SkCC compiles LLM skills via SkIR to achieve portability across agent frameworks, reduce adaptation effort from O(m×n) to O(m+n), and enforce security with reported gains in task success rates and token efficiency.

  3. SkCC: Portable and Secure Skill Compilation for Cross-Framework LLM Agents

    cs.CR 2026-05 unverdicted novelty 6.0

    SkCC compiles LLM agent skills through a strongly-typed IR and static security checks, cutting adaptation complexity from O(m×n) to O(m+n) and raising pass rates by 12-13 points on tested platforms.

  4. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  5. EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

    cs.CL 2026-02 conditional novelty 6.0

    EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.

  6. Code as Agent Harness

    cs.CL 2026-05 accept novelty 5.0

    A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed ...

  7. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

    cs.AI 2026-05 unverdicted novelty 5.0 partial

    Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.

  8. Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering

    cs.SE 2026-04 accept novelty 5.0

    LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

  9. Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development

    cs.SE 2026-05 unverdicted novelty 4.0

    Agentic Agile-V uses Agile-V as backbone and a Specify-Constrain-Orchestrate-Prove-Evolve-Verify loop to convert AI agent conversations into traceable engineering artifacts with acceptance evidence.

  10. Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 9 Pith papers · 3 internal anchors

  1. [1]

    Claude agent sdk: Overview and python sdk

    Anthropic. Claude agent sdk: Overview and python sdk. https://anthropic.mintlify.app/en/api/agent-sdk/overview, 2025 a . Accessed: 2025-10-29

  2. [2]

    Claude code: An agentic coding tool that lives in your terminal, 2025 b

    Anthropic. Claude code: An agentic coding tool that lives in your terminal, 2025 b . URL https://github.com/anthropics/claude-code. GitHub repository, accessed 2025-10-28

  3. [3]

    Litellm: Call 100+ llm apis in openai format

    BerriAI . Litellm: Call 100+ llm apis in openai format. https://github.com/BerriAI/litellm, 2024. Accessed: 2025-01-06

  4. [4]

    Devin: Ai software engineer

    Cognition AI . Devin: Ai software engineer. https://www.cognition.ai/devin, 2024. Accessed: 2025-01-06

  5. [5]

    Cursor: The ai-first code editor

    Cursor Team . Cursor: The ai-first code editor. https://www.cursor.com, 2024. Accessed: 2025-01-06

  6. [6]

    Github copilot: Your ai pair programmer

    GitHub . Github copilot: Your ai pair programmer. https://github.com/features/copilot, 2021. Accessed: 2025-01-06

  7. [7]

    Agent development kit (adk)

    Google. Agent development kit (adk). https://google.github.io/adk-docs/, 2025. Accessed: 2025-10-29

  8. [8]

    SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE -bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66

  9. [9]

    Langchain: Runnables and the langchain expression language (lcel)

    LangChain. Langchain: Runnables and the langchain expression language (lcel). https://api.python.langchain.com/en/latest/core/runnables.html, 2025. Accessed: 2025-10-29

  10. [10]

    Agents.md: The readme for your ai coding agents

    Guangya Liu. Agents.md: The readme for your ai coding agents. https://research.aimultiple.com/agents-md/, August 2025. Accessed: 2025-01-06

  11. [11]

    Model context protocol (mcp)? https://modelcontextprotocol.io, 2025

    MCP Team . Model context protocol (mcp)? https://modelcontextprotocol.io, 2025. Accessed: 2025-10-02

  12. [12]

    Gaia: a benchmark for general ai assistants

    Gr \'e goire Mialon, Cl \'e mentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. In The Twelfth International Conference on Learning Representations, 2023 a

  13. [13]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023 b . URL https://arxiv.org/abs/2311.12983

  14. [14]

    One year of openhands: A journey of open source ai development

    Graham Neubig. One year of openhands: A journey of open source ai development. All Hands AI Blog, March 2025. URL https://www.all-hands.dev/blog/one-year-of-openhands-a-journey-of-open-source-ai-development

  15. [15]

    Openai agents sdk

    OpenAI . Openai agents sdk. https://github.com/openai/openai-agents-python, 2024. Accessed: 2025-01-06

  16. [16]

    Agents sdk and guide

    OpenAI. Agents sdk and guide. https://platform.openai.com/docs/guides/agents, 2025. Accessed: 2025-10-29

  17. [17]

    Unions — discriminated unions

    Pydantic Team . Unions — discriminated unions. https://docs.pydantic.dev/latest/concepts/unions/#discriminated-unions, 2025. Accessed: 2025-10-29

  18. [18]

    Artificial Intelligence: A Modern Approach

    Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall Press, USA, 3rd edition, 2009. ISBN 0136042597

  19. [19]

    Openhands context condensensation for more efficient ai agents

    Calvin Smith. Openhands context condensensation for more efficient ai agents. All Hands AI Blog, April 2025. URL https://openhands.dev/blog/openhands-context-condensensation-for-more-efficient-ai-agents

  20. [20]

    Coding agents with multimodal browsing are generalist problem solvers,

    Aditya Bharat Soni, Boxuan Li, Xingyao Wang, Valerie Chen, and Graham Neubig. Coding agents with multimodal browsing are generalist problem solvers, 2025. URL https://arxiv.org/abs/2506.03011

  21. [21]

    Langgraph documentation (durable execution, deployment, server/cloud)

    LangGraph Team. Langgraph documentation (durable execution, deployment, server/cloud). https://docs.langchain.com/oss/python/langgraph/, 2025. Accessed: 2025-10-29

  22. [22]

    A Survey on Large Language Model based Autonomous Agents

    Lei Wang, Chengbang Ma, Xueyang Feng, Zeyu Zhang, Hao ran Yang, Jingsen Zhang, Zhi-Yang Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji rong Wen. A survey on large language model based autonomous agents. ArXiv, abs/2308.11432, 2023. URL https://api.semanticscholar.org/CorpusID:261064713

  23. [23]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for AI soft...

  24. [24]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...