Dockerless: Environment-Free Program Verifier for Coding Agents
Pith reviewed 2026-06-30 01:30 UTC · model grok-4.3
The pith
Dockerless judges code patches correct via agentic repository exploration instead of execution or Docker environments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dockerless is an environment-free agentic patch verifier that evaluates generated code patches without executing them or matching them to references; instead it judges correctness from evidence gathered through agentic repository exploration. When the same verifier is applied both as the SFT trajectory filter and as the RL reward signal, it produces a fully environment-free post-training pipeline whose model reaches 62.0 percent, 50.0 percent, and 35.2 percent resolve rate on SWE-bench Verified, Multilingual, and Pro respectively, exceeding the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points and matching the results of environment-based post-training.
What carries the argument
Dockerless, an agentic patch verifier that gathers evidence through repository exploration to judge patch correctness without execution.
If this is right
- A fully environment-free post-training pipeline becomes possible for coding agents.
- The verifier can be used simultaneously for SFT trajectory filtering and RL reward assignment.
- Performance on SWE-bench Verified, Multilingual, and Pro matches that of execution-based training.
- Dockerless outperforms the strongest open-source verifier by 14.3 AUC points on a dedicated verifier benchmark.
Where Pith is reading between the lines
- The same exploration-based judgment approach could be tested on verification tasks outside software, such as formal proof checking.
- If exploration evidence proves reliable, training pipelines could drop unit-test requirements entirely for some domains.
- Larger-scale repository exploration might improve verification accuracy on very large codebases where test coverage is incomplete.
- The method opens the possibility of mixing Dockerless signals with lightweight static analysis to reduce any remaining judgment errors.
Load-bearing premise
Evidence gathered by agentic repository exploration is sufficient to determine patch correctness without any execution or test running.
What would settle it
A large set of patches whose correctness judgments from Dockerless disagree with the outcomes of actual unit-test execution on the same patches.
read the original abstract
Program verifiers play a central role in training coding agents, including selecting trajectories for supervised fine-tuning (SFT) and providing rewards for reinforcement learning (RL). Standard execution-based verification requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs. We propose Dockerless, an environment-free agentic patch verifier that evaluates generated code patches without executing them. Rather than simply matching candidate patches to references, Dockerless judges patch correctness using evidence gathered through agentic repository exploration. On a verifier evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points. Using Dockerless as both the SFT trajectory filter and the RL reward enables a fully environment-free post-training pipeline. The resulting model reaches 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively. It surpasses the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points, matching environment-based post-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Dockerless, an agentic, environment-free program verifier for code patches. It judges patch correctness via evidence from repository exploration rather than execution or test running. Dockerless is reported to outperform the strongest open-source verifier by 14.3 AUC points on a verifier benchmark. Using it for both SFT trajectory filtering and RL rewards produces a post-training pipeline that yields 62.0%, 50.0%, and 35.2% resolve rates on SWE-bench Verified, Multilingual, and Pro, surpassing the Qwen3.5-9B baseline by 2.4–8.7 points and matching environment-based post-training.
Significance. If the verifier's judgments prove reliable proxies for actual patch correctness, the work would be significant for scaling coding-agent post-training. Removing per-repository Docker setup and execution costs could lower barriers to large-scale SFT/RL pipelines while preserving performance, which is a practical advance for the field.
major comments (2)
- [Dockerless verifier description and post-training experiments] The central claim that Dockerless enables environment-free post-training matching execution-based results rests on the assumption that agentic repository exploration alone suffices to determine patch correctness. The manuscript provides no analysis or ablation showing that this approach captures failure modes (e.g., runtime behavior, unvisited paths, or environment-specific interactions) that execution-based verification would detect; without such evidence the reported resolve-rate gains could reflect verifier artifacts rather than genuine quality.
- [Verifier evaluation section] The verifier benchmark result (14.3 AUC gain) is presented without accompanying details on dataset construction, baseline implementations, controls for exploration depth, or error analysis of false positives/negatives. This makes it impossible to determine whether the AUC improvement generalizes or is load-bearing for the downstream SFT/RL claims.
minor comments (2)
- [Abstract] The abstract states concrete performance numbers (AUC gain, resolve rates) with no accompanying method or dataset summary; this should be expanded for readability even if details appear later.
- [Method] Notation for the agentic exploration process (e.g., how evidence is aggregated into a correctness judgment) is introduced without a clear algorithmic pseudocode or formal definition.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing to revisions that strengthen the presentation of evidence while defending the empirical results as reported.
read point-by-point responses
-
Referee: [Dockerless verifier description and post-training experiments] The central claim that Dockerless enables environment-free post-training matching execution-based results rests on the assumption that agentic repository exploration alone suffices to determine patch correctness. The manuscript provides no analysis or ablation showing that this approach captures failure modes (e.g., runtime behavior, unvisited paths, or environment-specific interactions) that execution-based verification would detect; without such evidence the reported resolve-rate gains could reflect verifier artifacts rather than genuine quality.
Authors: We acknowledge that the manuscript does not contain explicit ablations isolating failure modes such as runtime behavior, unvisited paths, or environment-specific interactions. The reported resolve rates (62.0/50.0/35.2%) are shown to match environment-based post-training on the three SWE-bench splits, which provides indirect support for the verifier's utility, but we agree this does not substitute for targeted analysis. In the revision we will add a dedicated limitations subsection with qualitative examples of cases where repository exploration may miss execution-dependent errors, while retaining the empirical matching results as evidence of practical equivalence for the evaluated benchmarks. revision: partial
-
Referee: [Verifier evaluation section] The verifier benchmark result (14.3 AUC gain) is presented without accompanying details on dataset construction, baseline implementations, controls for exploration depth, or error analysis of false positives/negatives. This makes it impossible to determine whether the AUC improvement generalizes or is load-bearing for the downstream SFT/RL claims.
Authors: We agree that the verifier evaluation section requires additional methodological detail to support the 14.3 AUC claim and its connection to the post-training results. The revised manuscript will expand this section to describe the benchmark dataset construction process, the exact baseline verifier implementations and hyperparameters, controls and sensitivity analysis for exploration depth, and a breakdown of false-positive and false-negative cases with representative examples. revision: yes
Circularity Check
No circularity: empirical pipeline with external benchmarks
full rationale
The paper describes an agentic verifier (Dockerless) whose correctness judgments are produced by repository exploration rather than execution. It reports an AUC improvement on a verifier benchmark and downstream resolve rates on SWE-bench variants that are measured by the standard execution-based protocol. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The central claim (environment-free training matches environment-based results) is an empirical comparison against external benchmarks and baselines, not a reduction to the verifier's own outputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Agentic repository exploration can gather sufficient evidence to judge patch correctness without execution
Reference graph
Works this paper leans on
-
[1]
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026b. Swe-rebench v2: Language-agnostic swe task collection at scale. arXiv preprint arXiv:2602.23866. Silin Chen, Shaoxin Lin, Y uling Shi, Heng Lian, Xiaodong Gu, Longfei Y un, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, and 1 others
-
[2]
arXiv preprint arXiv:2507.23361
Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361. Xiang Deng, Jeff Da, Edwin Pan, Y annis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, and 1 others
-
[3]
Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R
Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941. Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. 2025a. Search-based llms for code optimization. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 – May 6, 2025 , pages 578–590. IEEE...
Pith/arXiv arXiv 2025
-
[4]
What makes good in-context demonstrations for code intelligence tasks with llms? In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11–15, 2023 , pages 761–773. IEEE. Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey...
2023
-
[5]
arXiv preprint arXiv:2508.03501
Training long-context, multi-turn software engineering agents with reinforcement learning. arXiv preprint arXiv:2508.03501. Hao Han, Jin Xie, Xuehao Ma, Weiquan Zhu, Ziyao Zhang, ZhiLiang Long, Hongkai Chen, and Qingwen Y e
-
[6]
arXiv preprint arXiv:2604.14820
Swe- trace: Optimizing long-horizon swe agents through rubric process reward models and heuristic test-time scaling. arXiv preprint arXiv:2604.14820. Chao Hu, Wenhao Zeng, Y uling Shi, Beijun Shen, and Xiaodong Gu
-
[7]
arXiv preprint arXiv:2601.00376
In line with context: Repository-level code generation via context inlining. arXiv preprint arXiv:2601.00376. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica
-
[8]
arXiv preprint arXiv:2504.07164
R2e-gym: Procedural environments and hybrid verifiers for scaling open-weights swe agents. arXiv preprint arXiv:2504.07164. Carlos E Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Ofir Press, and Karthik Narasimhan
-
[9]
Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang
Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations , volume 2024, pages 54107–54157. Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang
2024
-
[10]
arXiv preprint arXiv:2507.23348
Swe- debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348. Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, and 1 others
-
[11]
arXiv preprint arXiv:2603.05026
Repolaunch: Automating build&test pipeline of code repositories on any language and any platform. arXiv preprint arXiv:2603.05026. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others
-
[12]
2: Pushing the frontier of open large language models
Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, and Boris Ginsburg
-
[13]
arXiv preprint arXiv:2604.01496
From swe-zero to swe-hero: Execution-free to execution-based fine-tuning for software engineering agents. arXiv preprint arXiv:2604.01496. Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun V enkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and ...
-
[14]
arXiv preprint arXiv:2602.04254
Scaling agentic verifier for competitive coding. arXiv preprint arXiv:2602.04254. OpenAI
-
[15]
In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13
Domain knowledge matters: Improving prompts with fix templates for repairing python type errors . In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13. ACM. Qwen Team
2024
-
[16]
arXiv preprint arXiv:2601.04171
Agentic rubrics as contextual verifiers for swe agents. arXiv preprint arXiv:2601.04171. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, and 1 others
-
[17]
arXiv preprint arXiv:2402.03300
Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 11 REFERENCES 6 Y uling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu
-
[18]
arXiv preprint arXiv:2410.01215
From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215. KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, Jiaxi Y ang, Y uzhen Huang, Junyang Lin, Junxian He, and 1 others
-
[19]
arXiv preprint arXiv:2512.21919
Swe-rm: Execution-free feedback for software engineering agents. arXiv preprint arXiv:2512.21919. Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, and 1 others
-
[20]
arXiv preprint arXiv:2602.03419
Swe-world: Building software engineering agents in docker-free environments. arXiv preprint arXiv:2602.03419. Chaofan Tao, Jierun Chen, Y uxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Y ang, Yiming Du, Jianbo Dai, and 1 others
-
[21]
arXiv preprint arXiv:2601.01426
Swe-lego: Pushing the limits of supervised fine-tuning for software issue resolving. arXiv preprint arXiv:2601.01426. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others
-
[22]
5: Visual agentic intelligence
Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276. Haoran Wang, Zhenyu Hou, Y ao Wei, Jie Tang, and Y uxiao Dong. 2025a. Swe-dev: Building software engineering agents with training and inference scaling. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 3742–3761. Xingyao Wang, V alerie Chen, Heng Ji, an...
Pith/arXiv arXiv 2025
-
[23]
arXiv preprint arXiv:2603.03800
A rubric-supervised critic from sparse real-world outcomes. arXiv preprint arXiv:2603.03800. Xingyao Wang, Boxuan Li, Y ufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Y ueqi Song, Bowen Li, Jaskirat Singh, and 1 others. 2025b. Openhands: An open platform for ai software developers as generalist agents. In International Conference on Learn...
arXiv 2025
-
[24]
arXiv preprint arXiv:2510.22775
Scalable supervising software agents with patch reasoner. arXiv preprint arXiv:2510.22775. John Y ang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Y ao, Karthik Narasimhan, and Ofir Press
-
[25]
arXiv preprint arXiv:2509.23045
Kimi-dev: Agentless training as skill prior for swe-agents. arXiv preprint arXiv:2509.23045. Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiaochen Zuo, Y u Y ue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others
-
[26]
arXiv preprint arXiv:2308.01825
Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825. Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Li Aoyan, Lu Chen, Xiaojian Zhong, and 1 others
-
[27]
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, and 1 others. 2026a. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763. Wenhao Zeng, Y aoning Wang, Chao Hu, Y uling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu
-
[28]
arXiv preprint arXiv:2508.05988
Pruning the unsurprising: Efficient code reasoning via first-token surprisal. arXiv preprint arXiv:2508.05988. Wenhao Zeng, Xuteng Zhang, Y uling Shi, Chao Hu, Y uting Chen, Beijun Shen, and Xiaodong Gu. 2026b. Glimprouter: Efficient collaborative inference by glimpsing one token of thoughts. arXiv preprint arXiv:2601.05110. Jiazheng Zhang, Ziche Fu, Zhiheng...
-
[29]
These datasets are disjoint from our verifier evaluation benchmark built from SWE-bench V erified and Multi-SWE- bench Flash
and Multi-SWE-RL (Zan et al., 2026), with r⋆ ∈ { 0, 1} being the verdict obtained from running the held-out unit tests on the candidate patch. These datasets are disjoint from our verifier evaluation benchmark built from SWE-bench V erified and Multi-SWE- bench Flash. For each source example, a strong frontier teacher model (GLM-5) proposes one or more cand...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.