DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder

· 2026 · cs.AI · arXiv 2602.00592

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Reliable Docker-based environment construction is a dominant bottleneck for scaling execution-grounded training and evaluation of software engineering agents. We introduce DockSmith, a specialized agentic Docker builder designed to address this challenge. DockSmith treats environment construction not only as a preprocessing step, but as a core agentic capability that exercises long-horizon tool use, dependency reasoning, and failure recovery, yielding supervision that transfers beyond Docker building itself. DockSmith is trained on large-scale, execution-grounded Docker-building trajectories produced by a SWE-Factory-style pipeline augmented with a loop-detection controller and a cross-task success memory. Training a 30B-A3B model on these trajectories achieves open-source state-of-the-art performance on Multi-Docker-Eval, with 39.72% Fail-to-Pass and 58.28% Commit Rate. Moreover, DockSmith improves out-of-distribution performance on SWE-bench Verified, SWE-bench Multilingual, and Terminal-Bench 2.0, demonstrating broader agentic benefits of environment construction.

representative citing papers

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

cs.SE · 2026-06-03 · unverdicted · novelty 7.0

DeployBench is a new benchmark of 51 research-artifact deployment tasks where four LLMs with OpenHands achieve 7.8-51% pass rates, with failures mostly from agents stopping after weaker self-checks than the paper requires.

citing papers explorer

Showing 1 of 1 citing paper.

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment cs.SE · 2026-06-03 · unverdicted · none · ref 27 · internal anchor
DeployBench is a new benchmark of 51 research-artifact deployment tasks where four LLMs with OpenHands achieve 7.8-51% pass rates, with failures mostly from agents stopping after weaker self-checks than the paper requires.

DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder

fields

years

verdicts

representative citing papers

citing papers explorer