Dockerless: Environment-Free Program Verifier for Coding Agents

Chaofan Wang; Chao Hu; Hongting Zhou; Jianqiao Wangni; Kai Cai; Mengnan Qi; Shilin He; Shuzheng Gao; Wenhao Zeng; Xiaodong Gu

arxiv: 2606.28436 · v1 · pith:F6LKGV45new · submitted 2026-06-26 · 💻 cs.SE · cs.AI

Dockerless: Environment-Free Program Verifier for Coding Agents

Wenhao Zeng , Yuling Shi , Xiaodong Gu , Chao Hu , Chaofan Wang , Yuhao Cui , Hongting Zhou , Mengnan Qi

show 5 more authors

Jianqiao Wangni Zhaojian Yu Shuzheng Gao Kai Cai Shilin He

This is my paper

Pith reviewed 2026-06-30 01:30 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords program verifiercoding agentspatch verificationenvironment-freeagentic explorationSWE-benchpost-trainingreinforcement learning

0 comments

The pith

Dockerless judges code patches correct via agentic repository exploration instead of execution or Docker environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Dockerless as a program verifier that determines whether generated code patches are correct by collecting evidence through agentic exploration of the target repository. This verifier is then used to filter trajectories for supervised fine-tuning and to supply rewards during reinforcement learning. The resulting pipeline trains coding agents without any per-repository environment setup or test execution. On SWE-bench Verified, Multilingual, and Pro the trained model reaches 62.0 percent, 50.0 percent, and 35.2 percent resolve rate, exceeding the Qwen3.5-9B baseline while equaling the performance of environment-based training. A reader would care because the method removes the dominant cost of maintaining Docker images for large-scale agent post-training.

Core claim

Dockerless is an environment-free agentic patch verifier that evaluates generated code patches without executing them or matching them to references; instead it judges correctness from evidence gathered through agentic repository exploration. When the same verifier is applied both as the SFT trajectory filter and as the RL reward signal, it produces a fully environment-free post-training pipeline whose model reaches 62.0 percent, 50.0 percent, and 35.2 percent resolve rate on SWE-bench Verified, Multilingual, and Pro respectively, exceeding the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points and matching the results of environment-based post-training.

What carries the argument

Dockerless, an agentic patch verifier that gathers evidence through repository exploration to judge patch correctness without execution.

If this is right

A fully environment-free post-training pipeline becomes possible for coding agents.
The verifier can be used simultaneously for SFT trajectory filtering and RL reward assignment.
Performance on SWE-bench Verified, Multilingual, and Pro matches that of execution-based training.
Dockerless outperforms the strongest open-source verifier by 14.3 AUC points on a dedicated verifier benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same exploration-based judgment approach could be tested on verification tasks outside software, such as formal proof checking.
If exploration evidence proves reliable, training pipelines could drop unit-test requirements entirely for some domains.
Larger-scale repository exploration might improve verification accuracy on very large codebases where test coverage is incomplete.
The method opens the possibility of mixing Dockerless signals with lightweight static analysis to reduce any remaining judgment errors.

Load-bearing premise

Evidence gathered by agentic repository exploration is sufficient to determine patch correctness without any execution or test running.

What would settle it

A large set of patches whose correctness judgments from Dockerless disagree with the outcomes of actual unit-test execution on the same patches.

read the original abstract

Program verifiers play a central role in training coding agents, including selecting trajectories for supervised fine-tuning (SFT) and providing rewards for reinforcement learning (RL). Standard execution-based verification requires running unit tests inside per-repository environments such as Docker images, incurring substantial environment setup costs. We propose Dockerless, an environment-free agentic patch verifier that evaluates generated code patches without executing them. Rather than simply matching candidate patches to references, Dockerless judges patch correctness using evidence gathered through agentic repository exploration. On a verifier evaluation benchmark, Dockerless outperforms the strongest open-source verifier by 14.3 AUC points. Using Dockerless as both the SFT trajectory filter and the RL reward enables a fully environment-free post-training pipeline. The resulting model reaches 62.0%, 50.0%, and 35.2% resolve rate on SWE-bench Verified, Multilingual, and Pro, respectively. It surpasses the Qwen3.5-9B baseline by 2.4, 8.7, and 2.9 points, matching environment-based post-training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dockerless replaces execution with agentic exploration for patch verification and claims to match environment-based post-training results, but the reliability of that substitution is the open question.

read the letter

The main takeaway is that this work removes the Docker setup step from coding-agent post-training by using an agentic verifier instead of running tests. It reports a 14.3 AUC gain on a verifier benchmark and then shows the resulting model hitting 62.0/50.0/35.2 percent resolve rates on the three SWE-bench splits, within a few points of the environment-based baseline.

What is actually new is the specific move from execution or reference matching to evidence collected through repository exploration. The pipeline uses the same verifier for both SFT filtering and RL reward, which is a clean end-to-end demonstration. That part is useful for groups that cannot afford per-repo environments.

The soft spot is the lack of evidence that agentic exploration catches the same failures execution would catch. The abstract gives performance numbers but no breakdown of what kinds of bugs the verifier misses, no comparison of false-negative rates on dynamic or environment-specific issues, and no error analysis on the patches that pass the verifier yet fail when actually run. If those gaps exist in the full paper, the matching resolve rates could be partly an artifact of the verifier's blind spots rather than proof that the signal is interchangeable.

The math and data presentation look standard for an empirical systems paper; nothing in the abstract suggests circularity or invented baselines. The citation pattern is not visible here.

This is for labs training coding agents who want to reduce infrastructure overhead. A reader already working on SWE-bench-style tasks will get the most out of the numbers and the pipeline description. It is worth sending to peer review because the practical claim is clear and the method is falsifiable once the verifier details and failure cases are examined.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Dockerless, an agentic, environment-free program verifier for code patches. It judges patch correctness via evidence from repository exploration rather than execution or test running. Dockerless is reported to outperform the strongest open-source verifier by 14.3 AUC points on a verifier benchmark. Using it for both SFT trajectory filtering and RL rewards produces a post-training pipeline that yields 62.0%, 50.0%, and 35.2% resolve rates on SWE-bench Verified, Multilingual, and Pro, surpassing the Qwen3.5-9B baseline by 2.4–8.7 points and matching environment-based post-training.

Significance. If the verifier's judgments prove reliable proxies for actual patch correctness, the work would be significant for scaling coding-agent post-training. Removing per-repository Docker setup and execution costs could lower barriers to large-scale SFT/RL pipelines while preserving performance, which is a practical advance for the field.

major comments (2)

[Dockerless verifier description and post-training experiments] The central claim that Dockerless enables environment-free post-training matching execution-based results rests on the assumption that agentic repository exploration alone suffices to determine patch correctness. The manuscript provides no analysis or ablation showing that this approach captures failure modes (e.g., runtime behavior, unvisited paths, or environment-specific interactions) that execution-based verification would detect; without such evidence the reported resolve-rate gains could reflect verifier artifacts rather than genuine quality.
[Verifier evaluation section] The verifier benchmark result (14.3 AUC gain) is presented without accompanying details on dataset construction, baseline implementations, controls for exploration depth, or error analysis of false positives/negatives. This makes it impossible to determine whether the AUC improvement generalizes or is load-bearing for the downstream SFT/RL claims.

minor comments (2)

[Abstract] The abstract states concrete performance numbers (AUC gain, resolve rates) with no accompanying method or dataset summary; this should be expanded for readability even if details appear later.
[Method] Notation for the agentic exploration process (e.g., how evidence is aggregated into a correctness judgment) is introduced without a clear algorithmic pseudocode or formal definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, agreeing to revisions that strengthen the presentation of evidence while defending the empirical results as reported.

read point-by-point responses

Referee: [Dockerless verifier description and post-training experiments] The central claim that Dockerless enables environment-free post-training matching execution-based results rests on the assumption that agentic repository exploration alone suffices to determine patch correctness. The manuscript provides no analysis or ablation showing that this approach captures failure modes (e.g., runtime behavior, unvisited paths, or environment-specific interactions) that execution-based verification would detect; without such evidence the reported resolve-rate gains could reflect verifier artifacts rather than genuine quality.

Authors: We acknowledge that the manuscript does not contain explicit ablations isolating failure modes such as runtime behavior, unvisited paths, or environment-specific interactions. The reported resolve rates (62.0/50.0/35.2%) are shown to match environment-based post-training on the three SWE-bench splits, which provides indirect support for the verifier's utility, but we agree this does not substitute for targeted analysis. In the revision we will add a dedicated limitations subsection with qualitative examples of cases where repository exploration may miss execution-dependent errors, while retaining the empirical matching results as evidence of practical equivalence for the evaluated benchmarks. revision: partial
Referee: [Verifier evaluation section] The verifier benchmark result (14.3 AUC gain) is presented without accompanying details on dataset construction, baseline implementations, controls for exploration depth, or error analysis of false positives/negatives. This makes it impossible to determine whether the AUC improvement generalizes or is load-bearing for the downstream SFT/RL claims.

Authors: We agree that the verifier evaluation section requires additional methodological detail to support the 14.3 AUC claim and its connection to the post-training results. The revised manuscript will expand this section to describe the benchmark dataset construction process, the exact baseline verifier implementations and hyperparameters, controls and sensitivity analysis for exploration depth, and a breakdown of false-positive and false-negative cases with representative examples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external benchmarks

full rationale

The paper describes an agentic verifier (Dockerless) whose correctness judgments are produced by repository exploration rather than execution. It reports an AUC improvement on a verifier benchmark and downstream resolve rates on SWE-bench variants that are measured by the standard execution-based protocol. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or described method. The central claim (environment-free training matches environment-based results) is an empirical comparison against external benchmarks and baselines, not a reduction to the verifier's own outputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no free parameters, invented entities, or explicit axioms are stated. The central assumption that agentic exploration yields reliable correctness signals is treated as a domain assumption.

axioms (1)

domain assumption Agentic repository exploration can gather sufficient evidence to judge patch correctness without execution
This premise is required for the environment-free claim to hold; it is invoked by the description of how Dockerless works.

pith-pipeline@v0.9.1-grok · 5759 in / 1394 out tokens · 46564 ms · 2026-06-30T01:30:54.628274+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 11 linked inside Pith

[1]

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026b. Swe-rebench v2: Language-agnostic swe task collection at scale. arXiv preprint arXiv:2602.23866. Silin Chen, Shaoxin Lin, Y uling Shi, Heng Lian, Xiaodong Gu, Longfei Y un, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, and 1 others

Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2507.23361

Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361. Xiang Deng, Jeff Da, Edwin Pan, Y annis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, and 1 others

arXiv
[3]

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941. Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. 2025a. Search-based llms for code optimization. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 – May 6, 2025 , pages 578–590. IEEE...

Pith/arXiv arXiv 2025
[4]

What makes good in-context demonstrations for code intelligence tasks with llms? In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11–15, 2023 , pages 761–773. IEEE. Alexander Golubev, Maria Troﬁmova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey...

2023
[5]

arXiv preprint arXiv:2508.03501

Training long-context, multi-turn software engineering agents with reinforcement learning. arXiv preprint arXiv:2508.03501. Hao Han, Jin Xie, Xuehao Ma, Weiquan Zhu, Ziyao Zhang, ZhiLiang Long, Hongkai Chen, and Qingwen Y e

arXiv
[6]

arXiv preprint arXiv:2604.14820

Swe- trace: Optimizing long-horizon swe agents through rubric process reward models and heuristic test-time scaling. arXiv preprint arXiv:2604.14820. Chao Hu, Wenhao Zeng, Y uling Shi, Beijun Shen, and Xiaodong Gu

Pith/arXiv arXiv
[7]

arXiv preprint arXiv:2601.00376

In line with context: Repository-level code generation via context inlining. arXiv preprint arXiv:2601.00376. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica

Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2504.07164

R2e-gym: Procedural environments and hybrid veriﬁers for scaling open-weights swe agents. arXiv preprint arXiv:2504.07164. Carlos E Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Oﬁr Press, and Karthik Narasimhan

arXiv
[9]

Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang

Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations , volume 2024, pages 54107–54157. Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang

2024
[10]

arXiv preprint arXiv:2507.23348

Swe- debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348. Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, and 1 others

arXiv
[11]

arXiv preprint arXiv:2603.05026

Repolaunch: Automating build&test pipeline of code repositories on any language and any platform. arXiv preprint arXiv:2603.05026. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others

Pith/arXiv arXiv
[12]

2: Pushing the frontier of open large language models

Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, and Boris Ginsburg

Pith/arXiv arXiv
[13]

arXiv preprint arXiv:2604.01496

From swe-zero to swe-hero: Execution-free to execution-based ﬁne-tuning for software engineering agents. arXiv preprint arXiv:2604.01496. Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun V enkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and ...

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2602.04254

Scaling agentic veriﬁer for competitive coding. arXiv preprint arXiv:2602.04254. OpenAI

arXiv
[15]

In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13

Domain knowledge matters: Improving prompts with ﬁx templates for repairing python type errors . In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13. ACM. Qwen Team

2024
[16]

arXiv preprint arXiv:2601.04171

Agentic rubrics as contextual veriﬁers for swe agents. arXiv preprint arXiv:2601.04171. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, and 1 others

arXiv
[17]

arXiv preprint arXiv:2402.03300

Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 11 REFERENCES 6 Y uling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu

Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2410.01215

From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215. KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, Jiaxi Y ang, Y uzhen Huang, Junyang Lin, Junxian He, and 1 others

arXiv
[19]

arXiv preprint arXiv:2512.21919

Swe-rm: Execution-free feedback for software engineering agents. arXiv preprint arXiv:2512.21919. Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, and 1 others

arXiv
[20]

arXiv preprint arXiv:2602.03419

Swe-world: Building software engineering agents in docker-free environments. arXiv preprint arXiv:2602.03419. Chaofan Tao, Jierun Chen, Y uxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Y ang, Yiming Du, Jianbo Dai, and 1 others

arXiv
[21]

arXiv preprint arXiv:2601.01426

Swe-lego: Pushing the limits of supervised ﬁne-tuning for software issue resolving. arXiv preprint arXiv:2601.01426. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others

arXiv
[22]

5: Visual agentic intelligence

Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276. Haoran Wang, Zhenyu Hou, Y ao Wei, Jie Tang, and Y uxiao Dong. 2025a. Swe-dev: Building software engineering agents with training and inference scaling. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 3742–3761. Xingyao Wang, V alerie Chen, Heng Ji, an...

Pith/arXiv arXiv 2025
[23]

arXiv preprint arXiv:2603.03800

A rubric-supervised critic from sparse real-world outcomes. arXiv preprint arXiv:2603.03800. Xingyao Wang, Boxuan Li, Y ufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Y ueqi Song, Bowen Li, Jaskirat Singh, and 1 others. 2025b. Openhands: An open platform for ai software developers as generalist agents. In International Conference on Learn...

arXiv 2025
[24]

arXiv preprint arXiv:2510.22775

Scalable supervising software agents with patch reasoner. arXiv preprint arXiv:2510.22775. John Y ang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Y ao, Karthik Narasimhan, and Oﬁr Press

arXiv
[25]

arXiv preprint arXiv:2509.23045

Kimi-dev: Agentless training as skill prior for swe-agents. arXiv preprint arXiv:2509.23045. Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiaochen Zuo, Y u Y ue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

arXiv
[26]

arXiv preprint arXiv:2308.01825

Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825. Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Li Aoyan, Lu Chen, Xiaojian Zhong, and 1 others

Pith/arXiv arXiv
[27]

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, and 1 others. 2026a. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763. Wenhao Zeng, Y aoning Wang, Chao Hu, Y uling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu

Pith/arXiv arXiv
[28]

arXiv preprint arXiv:2508.05988

Pruning the unsurprising: Efﬁcient code reasoning via ﬁrst-token surprisal. arXiv preprint arXiv:2508.05988. Wenhao Zeng, Xuteng Zhang, Y uling Shi, Chao Hu, Y uting Chen, Beijun Shen, and Xiaodong Gu. 2026b. Glimprouter: Efﬁcient collaborative inference by glimpsing one token of thoughts. arXiv preprint arXiv:2601.05110. Jiazheng Zhang, Ziche Fu, Zhiheng...

arXiv
[29]

These datasets are disjoint from our veriﬁer evaluation benchmark built from SWE-bench V eriﬁed and Multi-SWE- bench Flash

and Multi-SWE-RL (Zan et al., 2026), with r⋆ ∈ { 0, 1} being the verdict obtained from running the held-out unit tests on the candidate patch. These datasets are disjoint from our veriﬁer evaluation benchmark built from SWE-bench V eriﬁed and Multi-SWE- bench Flash. For each source example, a strong frontier teacher model (GLM-5) proposes one or more cand...

2026

[1] [1]

Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, and Alexander Golubev. 2026b. Swe-rebench v2: Language-agnostic swe task collection at scale. arXiv preprint arXiv:2602.23866. Silin Chen, Shaoxin Lin, Y uling Shi, Heng Lian, Xiaodong Gu, Longfei Y un, Dong Chen, Lin Cao, Jiyang Liu, Nu Xia, and 1 others

Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2507.23361

Swe-exp: Experience-driven software issue resolution. arXiv preprint arXiv:2507.23361. Xiang Deng, Jeff Da, Edwin Pan, Y annis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, and 1 others

arXiv

[3] [3]

Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R

Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? arXiv preprint arXiv:2509.16941. Shuzheng Gao, Cuiyun Gao, Wenchao Gu, and Michael R. Lyu. 2025a. Search-based llms for code optimization. In 47th IEEE/ACM International Conference on Software Engineering, ICSE 2025, Ottawa, ON, Canada, April 26 – May 6, 2025 , pages 578–590. IEEE...

Pith/arXiv arXiv 2025

[4] [4]

What makes good in-context demonstrations for code intelligence tasks with llms? In 38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11–15, 2023 , pages 761–773. IEEE. Alexander Golubev, Maria Troﬁmova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey...

2023

[5] [5]

arXiv preprint arXiv:2508.03501

Training long-context, multi-turn software engineering agents with reinforcement learning. arXiv preprint arXiv:2508.03501. Hao Han, Jin Xie, Xuehao Ma, Weiquan Zhu, Ziyao Zhang, ZhiLiang Long, Hongkai Chen, and Qingwen Y e

arXiv

[6] [6]

arXiv preprint arXiv:2604.14820

Swe- trace: Optimizing long-horizon swe agents through rubric process reward models and heuristic test-time scaling. arXiv preprint arXiv:2604.14820. Chao Hu, Wenhao Zeng, Y uling Shi, Beijun Shen, and Xiaodong Gu

Pith/arXiv arXiv

[7] [7]

arXiv preprint arXiv:2601.00376

In line with context: Repository-level code generation via context inlining. arXiv preprint arXiv:2601.00376. Naman Jain, Jaskirat Singh, Manish Shetty, Liang Zheng, Koushik Sen, and Ion Stoica

Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2504.07164

R2e-gym: Procedural environments and hybrid veriﬁers for scaling open-weights swe agents. arXiv preprint arXiv:2504.07164. Carlos E Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Oﬁr Press, and Karthik Narasimhan

arXiv

[9] [9]

Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang

Swe-bench: Can language models resolve real-world github issues? In International Conference on Learning Representations , volume 2024, pages 54107–54157. Han Li, Y uling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Y antao Jia, Tao Huang, and Qianxiang Wang

2024

[10] [10]

arXiv preprint arXiv:2507.23348

Swe- debate: Competitive multi-agent debate for software issue resolution. arXiv preprint arXiv:2507.23348. Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, and 1 others

arXiv

[11] [11]

arXiv preprint arXiv:2603.05026

Repolaunch: Automating build&test pipeline of code repositories on any language and any platform. arXiv preprint arXiv:2603.05026. Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, and 1 others

Pith/arXiv arXiv

[12] [12]

2: Pushing the frontier of open large language models

Deepseek-v3. 2: Pushing the frontier of open large language models. arXiv preprint arXiv:2512.02556. Nikolai Ludwig, Wasi Uddin Ahmad, Somshubra Majumdar, and Boris Ginsburg

Pith/arXiv arXiv

[13] [13]

arXiv preprint arXiv:2604.01496

From swe-zero to swe-hero: Execution-free to execution-based ﬁne-tuning for software engineering agents. arXiv preprint arXiv:2604.01496. Michael Luo, Naman Jain, Jaskirat Singh, Sijun Tan, Ameen Patel, Qingyang Wu, Alpay Ariyak, Colin Cai, Tarun V enkat, Shang Zhu, Ben Athiwaratkun, Manan Roongta, Ce Zhang, Li Erran Li, Raluca Ada Popa, Koushik Sen, and ...

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2602.04254

Scaling agentic veriﬁer for competitive coding. arXiv preprint arXiv:2602.04254. OpenAI

arXiv

[15] [15]

In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13

Domain knowledge matters: Improving prompts with ﬁx templates for repairing python type errors . In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14–20, 2024 , pages 4:1–4:13. ACM. Qwen Team

2024

[16] [16]

arXiv preprint arXiv:2601.04171

Agentic rubrics as contextual veriﬁers for swe agents. arXiv preprint arXiv:2601.04171. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y ang Wu, and 1 others

arXiv

[17] [17]

arXiv preprint arXiv:2402.03300

Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300. 11 REFERENCES 6 Y uling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu

Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2410.01215

From code to correctness: Closing the last mile of code generation with hierarchical debugging. arXiv preprint arXiv:2410.01215. KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, Jiaxi Y ang, Y uzhen Huang, Junyang Lin, Junxian He, and 1 others

arXiv

[19] [19]

arXiv preprint arXiv:2512.21919

Swe-rm: Execution-free feedback for software engineering agents. arXiv preprint arXiv:2512.21919. Shuang Sun, Huatong Song, Lisheng Huang, Jinhao Jiang, Ran Le, Zhihao Lv, Zongchao Chen, Yiwen Hu, Wenyang Luo, Wayne Xin Zhao, and 1 others

arXiv

[20] [20]

arXiv preprint arXiv:2602.03419

Swe-world: Building software engineering agents in docker-free environments. arXiv preprint arXiv:2602.03419. Chaofan Tao, Jierun Chen, Y uxin Jiang, Kaiqi Kou, Shaowei Wang, Ruoyu Wang, Xiaohui Li, Sidi Y ang, Yiming Du, Jianbo Dai, and 1 others

arXiv

[21] [21]

arXiv preprint arXiv:2601.01426

Swe-lego: Pushing the limits of supervised ﬁne-tuning for software issue resolving. arXiv preprint arXiv:2601.01426. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and 1 others

arXiv

[22] [22]

5: Visual agentic intelligence

Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276. Haoran Wang, Zhenyu Hou, Y ao Wei, Jie Tang, and Y uxiao Dong. 2025a. Swe-dev: Building software engineering agents with training and inference scaling. In Findings of the Association for Computational Linguistics: ACL 2025 , pages 3742–3761. Xingyao Wang, V alerie Chen, Heng Ji, an...

Pith/arXiv arXiv 2025

[23] [23]

arXiv preprint arXiv:2603.03800

A rubric-supervised critic from sparse real-world outcomes. arXiv preprint arXiv:2603.03800. Xingyao Wang, Boxuan Li, Y ufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Y ueqi Song, Bowen Li, Jaskirat Singh, and 1 others. 2025b. Openhands: An open platform for ai software developers as generalist agents. In International Conference on Learn...

arXiv 2025

[24] [24]

arXiv preprint arXiv:2510.22775

Scalable supervising software agents with patch reasoner. arXiv preprint arXiv:2510.22775. John Y ang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Y ao, Karthik Narasimhan, and Oﬁr Press

arXiv

[25] [25]

arXiv preprint arXiv:2509.23045

Kimi-dev: Agentless training as skill prior for swe-agents. arXiv preprint arXiv:2509.23045. Qiying Y u, Zheng Zhang, Ruofei Zhu, Y ufeng Y uan, Xiaochen Zuo, Y u Y ue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, and 1 others

arXiv

[26] [26]

arXiv preprint arXiv:2308.01825

Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825. Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Shulin Xin, Linhao Zhang, Qi Liu, Li Aoyan, Lu Chen, Xiaojian Zhong, and 1 others

Pith/arXiv arXiv

[27] [27]

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, and 1 others. 2026a. Glm-5: from vibe coding to agentic engineering. arXiv preprint arXiv:2602.15763. Wenhao Zeng, Y aoning Wang, Chao Hu, Y uling Shi, Chengcheng Wan, Hongyu Zhang, and Xiaodong Gu

Pith/arXiv arXiv

[28] [28]

arXiv preprint arXiv:2508.05988

Pruning the unsurprising: Efﬁcient code reasoning via ﬁrst-token surprisal. arXiv preprint arXiv:2508.05988. Wenhao Zeng, Xuteng Zhang, Y uling Shi, Chao Hu, Y uting Chen, Beijun Shen, and Xiaodong Gu. 2026b. Glimprouter: Efﬁcient collaborative inference by glimpsing one token of thoughts. arXiv preprint arXiv:2601.05110. Jiazheng Zhang, Ziche Fu, Zhiheng...

arXiv

[29] [29]

These datasets are disjoint from our veriﬁer evaluation benchmark built from SWE-bench V eriﬁed and Multi-SWE- bench Flash

and Multi-SWE-RL (Zan et al., 2026), with r⋆ ∈ { 0, 1} being the verdict obtained from running the held-out unit tests on the candidate patch. These datasets are disjoint from our veriﬁer evaluation benchmark built from SWE-bench V eriﬁed and Multi-SWE- bench Flash. For each source example, a strong frontier teacher model (GLM-5) proposes one or more cand...

2026