DeployBench: Benchmarking LLM Agents for Research Artifact Deployment
Pith reviewed 2026-06-28 05:32 UTC · model grok-4.3
The pith
LLM agents achieve pass rates of 7.8 to 51 percent when deploying research artifacts across 51 tasks
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DeployBench consists of 51 research-artifact deployment tasks verified by hidden pipelines that execute each paper's designated experiment and check its outputs. When four state-of-the-art LLMs are evaluated with OpenHands, pass rates range from 7.8 percent to 51.0 percent. Failures are dominated by a completion-judgment problem in which 97 of 154 cases are agent-terminated self-stops that validate a different or weaker target than the paper-specific task requires.
What carries the argument
The hidden verification pipeline that executes the paper's designated experiment and checks its outputs to determine whether deployment succeeded.
If this is right
- Agents must develop more precise pre-termination checks that align with paper-specific experimental requirements rather than weaker internal criteria.
- Successful deployment requires managing system-level dependencies such as GPU and CUDA configurations in addition to code-level setup.
- Pass rates remain low even for current leading models, indicating that autonomous research-artifact deployment is not yet reliable across the tested domains.
- The benchmark supplies a concrete testbed that can track progress as agent judgment and environment-handling capabilities improve.
Where Pith is reading between the lines
- If the judgment failures can be reduced, overall success rates on similar deployment tasks could increase markedly.
- The same self-stop pattern may limit agent performance on other multi-step benchmarks that involve hidden or paper-specific success criteria.
- Adding tasks from additional research domains would test whether the observed failure distribution holds beyond the current 51 tasks.
Load-bearing premise
The hidden pipelines accurately reproduce the original papers' intended experiments and produce correct pass/fail signals.
What would settle it
Running each hidden pipeline on an artifact that has been manually deployed according to the original paper and confirming whether the pipeline accepts or rejects that deployment.
read the original abstract
LLM agents have made rapid progress on software engineering and ML research tasks, but these advances often assume access to a working runnable environment. For research artifacts released alongside published papers, setting up such an environment from a fresh machine remains a major bottleneck. Existing environment setup benchmarks do not cover the full scope of research artifact deployment, which involves multi-language toolchains, system-level dependencies beyond containers (e.g. GPU/CUDA and kernel configurations), and legacy artifact compatibility. We introduce DeployBench, a multi-domain benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing, covering all these dimensions. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks its outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass-rates from 7.8% - 51.0% . Failures are dominated by a completion-judgment problem: 97 of 154 are agent-terminated self-stops, where the agent's pre-finish checks validate a different or weaker target than the paper-specific task requires. DeployBench highlights the gap between current agents and autonomous deployment, and offers a realistic testbed for scientific research agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeployBench, a benchmark of 51 research-artifact deployment tasks spanning AI/ML, computer systems, and scientific computing. Each task is verified by a hidden pipeline that executes the paper's designated experiment and checks outputs. Evaluating four state-of-the-art LLMs with OpenHands yields pass rates from 7.8% to 51.0%, with failures dominated by a completion-judgment problem (97 of 154 agent-terminated self-stops where agents validate weaker targets).
Significance. If the verification pipeline faithfully reproduces the original papers' experiments and the task set is representative, DeployBench would offer a valuable, realistic testbed for assessing LLM agents on complex, multi-language deployment scenarios that existing benchmarks overlook. The reported pass rates and failure-mode breakdown could usefully highlight gaps in current agent capabilities for autonomous research-artifact setup.
major comments (3)
- [Abstract] Abstract: All numeric results (pass rates 7.8–51 %, 97/154 self-stop failures) are defined relative to judgments from an uninspectable hidden verification pipeline. No description, code, example traces, or implementation details of this pipeline are supplied, so the grounding of the central claims cannot be assessed or reproduced.
- [Abstract] Abstract: The manuscript provides no details on task selection criteria, inclusion/exclusion rules, or domain-balance controls for the 51 tasks. Without these, it is impossible to evaluate whether the benchmark fairly represents the space of research-artifact deployment or whether the reported performance gap is an artifact of task curation.
- [Abstract] Abstract: No information is given on the number of independent runs per task, statistical significance testing of the pass rates, or controls for agent-prompting variations. These omissions make it difficult to determine whether the observed differences across the four LLMs are robust.
Simulated Author's Rebuttal
Thank you for the constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: All numeric results (pass rates 7.8–51 %, 97/154 self-stop failures) are defined relative to judgments from an uninspectable hidden verification pipeline. No description, code, example traces, or implementation details of this pipeline are supplied, so the grounding of the central claims cannot be assessed or reproduced.
Authors: We agree that the manuscript provides insufficient detail on the verification pipeline. The pipeline is kept hidden during agent runs to prevent exploitation of verification logic, but this choice limits external assessment. In revision we will add a methods subsection describing the pipeline architecture, the execution of each paper's designated experiment, output-checking logic, and example verification scripts for three representative tasks (one per domain). We will also outline a controlled-release process for the full pipeline to qualified researchers. These additions will ground the reported pass rates without exposing the benchmark to contamination. revision: yes
-
Referee: [Abstract] Abstract: The manuscript provides no details on task selection criteria, inclusion/exclusion rules, or domain-balance controls for the 51 tasks. Without these, it is impossible to evaluate whether the benchmark fairly represents the space of research-artifact deployment or whether the reported performance gap is an artifact of task curation.
Authors: We acknowledge the omission of explicit selection criteria. The 51 tasks were assembled to span AI/ML, systems, and scientific computing while covering multi-language toolchains, non-container dependencies, and legacy compatibility. In the revision we will insert a dedicated subsection that states the sourcing process, inclusion/exclusion rules (e.g., public GitHub artifacts with runnable experiments, post-2020 papers), and domain-balance targets. This will allow readers to judge representativeness. revision: yes
-
Referee: [Abstract] Abstract: No information is given on the number of independent runs per task, statistical significance testing of the pass rates, or controls for agent-prompting variations. These omissions make it difficult to determine whether the observed differences across the four LLMs are robust.
Authors: The current manuscript reports single-run results per model–agent pair, driven by compute cost, and does not include statistical tests or prompting-ablation controls. We will revise the evaluation section to document the exact protocol, note the single-run limitation explicitly, and add a limitations paragraph discussing robustness. Where feasible we will report any repeated runs performed during development; full multi-run statistics and prompting controls are planned for a follow-up release but cannot be retrofitted to the existing data without new experiments. revision: partial
Circularity Check
No significant circularity; benchmark results are direct empirical measurements
full rationale
The paper introduces DeployBench as an empirical benchmark consisting of 51 tasks whose success is defined by an external hidden verification pipeline that runs each paper's designated experiment. Reported pass rates (7.8%-51.0%) and failure counts (97/154 self-stops) are presented as direct measurements against those externally specified targets. No equations, fitted parameters, predictions derived from first principles, ansatzes, or uniqueness theorems appear in the provided text. No self-citations are invoked to justify any load-bearing step. The verification pipeline is an uninspectable assumption about task construction, but this is a question of external grounding rather than any reduction of a claimed derivation to its own inputs by construction. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 51 selected tasks adequately represent the full scope of research artifact deployment challenges across the three domains.
Reference graph
Works this paper leans on
-
[1]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InInterna- tional Conference on Learning Representations, 2024
2024
-
[2]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. ArXiv, abs/2405.15793, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
PaperBench: Evaluating AI's Ability to Replicate AI Research
Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Tejal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research.ArXiv, abs/2504.01848, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Lmr-bench: Evaluating llm agent’s ability on reproducing language modeling research
Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, and Xinya Du. Lmr-bench: Evaluating llm agent’s ability on reproducing language modeling research. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNL...
2025
-
[5]
MLE- bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. MLE- bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, 2025
2025
-
[6]
DSBench: How far are data science agents from becoming data science experts? InInternational Conference on Learning Representations, 2025
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench: How far are data science agents from becoming data science experts? InInternational Conference on Learning Representations, 2025
2025
-
[7]
Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi Jin
Bohan Lyu, Yucheng Yang, Siqiao Huang, Jiaru Zhang, Qixin Xu, Xinghan Li, Xinyang Han, Yicheng Zhang, Huaqing Zhang, Runhan Huang, Kaicheng Yang, Zitao Chen, Wentao Guo, Junlin Yang, Xinyue Ai, Wenhao Chai, Yadi Cao, Ziran Yang, Kun Wang, Dapeng Jiang, Huan ang Gao, Shange Tang, Chengshuai Shi, Simon S. Du, Max Simchowitz, Jiantao Jiao, Dawn Song, and Chi...
2026
-
[8]
Gonzalez, Jingbo Shang, and Alvin Cheung
Qiuyang Mang, Wenhao Chai, Zhifei Li, Huanzhi Mao, Shang Zhou, Alexander Du, Hanchen Li, Shu Liu, Edwin Chen, Yichuan Wang, Xieting Chu, Zerui Cheng, Yuan Xu, Tian Xia, Zirui Wang, Tianneng Shi, Jianzhu Yao, Yilong Zhao, Qizheng Zhang, Charlie Ruan, Zeyu Shen, Kaiyuan Liu, Runyuan He, Dong Xing, Zerui Li, Zirong Zeng, Yige Jiang, Lufeng Cheng, Ziyi Zhao, ...
2025
-
[9]
Mitigating configuration differences between development and production environments: A catalog of strategies, 2025
Marcos Nazario, Rodrigo Bonifacio, and Gustavo Pinto. Mitigating configuration differences between development and production environments: A catalog of strategies, 2025. 12
2025
-
[10]
Understanding llm-centric challenges for deep learning frameworks: An empirical analysis, 2025
Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Jiacong Wu, An Guo, Jiawei Shen, Bingzhuo Li, and Zhenyu Chen. Understanding llm-centric challenges for deep learning frameworks: An empirical analysis, 2025
2025
-
[11]
Avi Arora, Jinu Jang, and Roshanak Zilouchian Moghaddam. SetupBench: Assessing software engineering agents’ ability to bootstrap development environments.ArXiv, abs/2507.09063, 2025
-
[12]
EnvBench: A benchmark for automated environment setup.ArXiv, abs/2503.14443, 2025
Aleksandra Eliseeva, Alexander Kovrigin, Ilia Kholkin, Egor Bogomolov, and Yaroslav Zharov. EnvBench: A benchmark for automated environment setup.ArXiv, abs/2503.14443, 2025
-
[13]
CSR-Bench: Benchmarking LLM agents in deployment of computer science research repositories
Yijia Xiao, Runhui Wang, Luyang Kong, Davor Golac, and Wei Wang. CSR-Bench: Benchmarking LLM agents in deployment of computer science research repositories. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12705–12723, 2025
2025
-
[14]
Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, and Xipeng Qiu. ResearchEnvBench: Benchmarking agents on environment synthesis for research code execution.ArXiv, abs/2603.06739, 2026
-
[15]
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Daniel Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenHands: An open platform for AI softwar...
2025
-
[16]
RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun. RepoAgent: An LLM-powered open-source framework for repository-level code documentation generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System D...
2024
-
[17]
Jialiang Chen, Kaifa Zhao, Jie Liu, Chao Peng, Jierui Liu, Hang Zhu, Pengfei Gao, Ping Yang, and Shuiguang Deng. CoReQA: Uncovering potentials of language models in code repository question answering.ArXiv, abs/2501.03447, 2025
-
[18]
SWE-QA: Can Language Models Answer Repository-level Code Questions?
Weihan Peng, Yuling Shi, Yuhang Wang, Xinyun Zhang, Beijun Shen, and Xiaodong Gu. SWE-QA: Can language models answer repository-level code questions?ArXiv, abs/2509.14635, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024
Zeyu Leo Liu, Shrey Pandit, Xi Ye, Eunsol Choi, and Greg Durrett. CodeUpdateArena: Bench- marking knowledge editing on API updates.ArXiv, abs/2407.06249, 2024
-
[20]
Alexander Zhang, Marcus Dong, Jiaheng Liu, Wei Zhang, Yejie Wang, Jian Yang, Ge Zhang, Tianyu Liu, Zhongyuan Peng, Yingshui Tan, Yuanxing Zhang, Zhexu Wang, Weixun Wang, Yancheng He, Ken Deng, Wangchunshu Zhou, Wenhao Huang, and Zhaoxiang Zhang. CodeCrit- icBench: A holistic code critique benchmark for large language models.ArXiv, abs/2502.16614, 2025
-
[21]
Merrill, Alexander G
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, 13 Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen,...
2026
-
[22]
Beyond pip install: Evaluating LLM agents for the automated installation of Python projects
Louis Milliken, Sungmin Kang, and Shin Yoo. Beyond pip install: Evaluating LLM agents for the automated installation of Python projects. In2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–11, 2025
2025
-
[23]
You name it, i run it: An LLM agent to execute tests of arbitrary projects.Proceedings of the ACM on Software Engineering, 2(ISSTA):1054–1076, 2025
Islem Bouzenia and Michael Pradel. You name it, i run it: An LLM agent to execute tests of arbitrary projects.Proceedings of the ACM on Software Engineering, 2(ISSTA):1054–1076, 2025
2025
-
[24]
Repo2Run: Automated building executable environment for code repository at scale
Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, and Cuiyun Gao. Repo2Run: Automated building executable environment for code repository at scale. InAdvances in Neural Information Processing Systems, 2025
2025
-
[25]
Jiayi Kuang, Yinghui Li, Xin Zhang, Yangning Li, Di Yin, Xing Sun, Ying Shen, and Philip S. Yu. Process-level trajectory evaluation for environment configuration in software engineering agents. InInternational Conference on Learning Representations, 2026
2026
-
[26]
Xiang Li, Siyu Lu, Federica Sarro, Claire Le Goues, and He Ye. HerAgent: Rethinking the automated environment deployment via hierarchical test pyramid.ArXiv, abs/2602.07871, 2026
-
[27]
DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder
Jiaran Zhang, Luck Ma, Fanqi Wan, Di Qi, Xu Zhao, Jieyi Hou, Zhe Xie, Mengqiang Ren, Xin Wu, Zhewei Huang, Liangyu Chen, Qi Han, and Xiangyu Zhang. DockSmith: Scaling reliable coding environments via an agentic docker builder.ArXiv, abs/2602.00592, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
Systems Research Artifacts
Systems Research Artifacts. Systems Research Artifacts. https://sysartifacts.github.io/,
-
[29]
run via docker
Last accessed: May 2026. 14 Appendix The appendix is structured as follows: • Full list of benchmark source artifacts (Section A1). • The full agent system prompt used in all runs (Section A2). • The full diagnostic-agent prompt used to diagnose failed runs (Section A3). • Failure pattern counts across all-fail and mixed-outcome tasks (Section A4). • Addi...
2026
-
[30]
- Identify: the code folder and the paper PDF
Initial inspection - List the contents of <WORKDIR>. - Identify: the code folder and the paper PDF. - Create vm in a new directory vm under <WORKDIR>, ONLY if you must use a VM
-
[31]
ABS_PATH
Read instructions and infer requirements - Read the paper PDF to understand: required OS/kernel assumptions, hardware assumptions, and what a minimal smoke test would be (not the full benchmarks). - Read README / INSTALL / scripts in the code. - If instructions assume Docker, translate them into native host steps. Agent skills (optional -- use when helpfu...
-
[32]
- Install with apt when appropriate
Dependency resolution - Determine all build/runtime dependencies (compilers, libraries, Python/Rust/Go/Java, CUDA, etc.). - Install with apt when appropriate. - For language-specific deps: - Python: Keep the paper's environment isolated from the agent's own runtime; Create and use a project-specific venv under <WORKDIR>/env/. Do not use uv, conda, or any ...
-
[33]
- Fix path issues so everything runs when invoked from within <WORKDIR>
Build and configure - Build the artifact as required (e.g., make/cmake/bazel/meson). - Fix path issues so everything runs when invoked from within <WORKDIR>. - If the artifact or smoke test requires downloaded models, datasets, or weights: download them and ensure the smoke test can use them. Do not skip downloads needed for a minimal run
-
[34]
small demo, or a short run with minimal data)
Run a simple smoke test - Execute a minimal check that the setup works (e.g. small demo, or a short run with minimal data). Do NOT run the full paper experiments or long benchmarks
-
[35]
- Use QEMU/KVM if available; create VM disk under <WORKDIR>/vm/
If a VM is needed (only as last resort) - Explain why native host execution is infeasible. - Use QEMU/KVM if available; create VM disk under <WORKDIR>/vm/. - Provide: - VM OS image source and checksums if applicable - VM config (CPU/RAM/disk) and exact launch command(s) - How files are shared between host and VM (e.g., virtiofs/9p/scp) while keeping proje...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.