arxiv: 2605.03195 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.SE

Recognition: 2 theorem links

· Lean Theorem

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

Spandan Garg , Vikram Nitin , Yufan Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:05 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords small language modelssubagentsterminal executioncoding agentsSWE-Benchreinforcement learningfinetuningagentic tasks

0 comments

The pith

A 4B model can match frontier LLMs at agentic terminal execution tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether a small finetuned model can serve as an effective subagent for terminal execution in coding agents. Terminus-4B starts from Qwen3-4B and receives supervised finetuning plus reinforcement learning driven by rubric-based LLM-as-judge rewards to specialize in this narrow responsibility. If the approach holds, main agents can delegate verbose execution work, cut their own token consumption by up to 30 percent, and still match or exceed the results of much larger models on benchmarks such as SWE-Bench Pro and internal C# variants.

Core claim

Terminus-4B is a post-trained Qwen3-4B model via SFT and RL using rubric-based LLM-as-judge reward for agentic terminal execution. It reduces the main agent's token usage by up to 30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro and the internal SWE-Bench C# benchmark. The model also improves metrics that show the main agent relying more on subagent outputs and performing fewer terminal tasks itself, closing the gap to frontier models like Claude Sonnet, Opus, and GPT-5.3-Codex and often exceeding their performance.

What carries the argument

Terminus-4B, a 4B-parameter model post-trained on Qwen3-4B via supervised finetuning and reinforcement learning with rubric-based LLM-as-judge rewards to handle terminal execution subtasks.

If this is right

Main agents delegate more terminal execution work to the subagent and execute fewer commands themselves.
Overall token usage by the main agent falls by up to 30% while benchmark performance stays the same.
Small specialized models can close or surpass the performance of frontier models on this narrow agentic task.
Agent architectures can isolate verbose outputs such as build logs inside smaller, cheaper subagent loops.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same post-training recipe could be repeated for other narrow subtasks such as code search or debugging to create additional efficient components.
Widespread use of such models would lower the cost of running multi-step coding agents by reducing expensive frontier-model calls.
Task-specific small models may become standard building blocks that let larger frontier models focus only on high-level planning.

Load-bearing premise

That the rubric-based LLM-as-judge reward produces training signals that generalize beyond the specific benchmarks and do not embed biases from the judge model itself.

What would settle it

A significant drop in success rate or token savings when Terminus-4B is evaluated on fresh terminal execution tasks outside the SWE-Bench family or when the reward judge model is replaced with a different one.

Figures

Figures reproduced from arXiv: 2605.03195 by Spandan Garg, Vikram Nitin, Yufan Huang.

**Figure 1.** Figure 1: Contrasting agent trajectories for a real issue in the view at source ↗

**Figure 4.** Figure 4: Instructions added to the main agent system view at source ↗

**Figure 3.** Figure 3: System prompt for the Execution Subagent. The view at source ↗

**Figure 5.** Figure 5: Prompt used with a frontier LLM to generate view at source ↗

**Figure 6.** Figure 6: Training plots for GRPO with and without SFT initialization. SFT starts RL training off with a strong prior view at source ↗

**Figure 7.** Figure 7: A condensed version of the LLM-judge Score view at source ↗

**Figure 8.** Figure 8: LLM judge scores for subagent response quality view at source ↗

read the original abstract

Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's context window clean by isolating verbose outputs (e.g. build logs, test results, etc.) within the subagent context. Typically when agents employ subagents for such tasks, they use frontier models as these subagents. In this paper, we investigate whether a finetuned small language model (SLM) can achieve comparable performance to frontier models in the task of agentic terminal execution. We present Terminus-4B, which is a post-trained Qwen3-4B model via Supervised Finetuning (SFT) and Reinforcement Learning (RL) using rubric-based LLM-as-judge reward, specifically for this task. In our extensive evaluation spanning various frontier models, training ablations and main agent configurations, we find that Terminus-4B is able to reduce the token usage of the main agent by up to ~30% compared to the No Subagent baseline with no impact to agent performance on benchmarks like SWE-Bench Pro and our internal SWE-Bench C# benchmark, which tends to be heavy in verbose execution tasks. Furthermore, Terminus-4B improves key metrics showing the main agent relying on the outputs of the subagent and doing fewer terminal execution tasks by itself. We see that our model not only closes the gap between the Vanilla Qwen model and frontier models like Claude Sonnet / Opus / GPT-5.3-Codex, but often even exceeds their performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents Terminus-4B, a 4B-parameter model obtained by post-training Qwen3-4B via supervised fine-tuning followed by reinforcement learning that uses a rubric-based LLM-as-judge reward. The central claim is that this model, deployed as a subagent for terminal execution, reduces main-agent token consumption by up to ~30% relative to a no-subagent baseline while preserving or improving performance on SWE-Bench Pro and an internal SWE-Bench C# benchmark, increases main-agent reliance on subagent outputs, and matches or exceeds frontier models (Claude Sonnet/Opus, GPT-5.3-Codex) on the same tasks.

Significance. If the empirical results are robust, the work would demonstrate that small specialized models can replace frontier LLMs for narrow agentic subtasks, yielding measurable efficiency gains in multi-agent coding systems. The rubric-based RL approach, if shown to correlate with objective success, could also inform reward design for other execution-oriented agent loops.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation sections: the headline claims of ~30% token reduction, unchanged benchmark performance, and superior subagent metrics are stated without accompanying numerical tables, confidence intervals, statistical tests, or ablation results that isolate the contribution of the rubric-based LLM-as-judge reward versus direct outcome signals (e.g., command success or test-pass rates).
[Methods and Training] Methods and Training sections: no correlation analysis, cross-judge validation, or human evaluation on held-out terminal traces is reported to establish that the LLM-as-judge rubric produces signals that improve genuine execution quality rather than merely outputs favored by the judge model.
[Experiments] Experiments section: the evaluation lacks controls for benchmark overfitting or judge-specific bias, such as ablations replacing the rubric reward with outcome-based rewards or testing generalization on tasks outside the training distribution.

minor comments (2)

[Abstract and Evaluation] The internal SWE-Bench C# benchmark is mentioned repeatedly but never described in terms of size, task distribution, or construction details, hindering reproducibility.
[Evaluation] Notation for the main-agent versus subagent token counts and reliance metrics could be clarified with an explicit equation or table definition early in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each of the major comments below, providing clarifications and indicating revisions to the manuscript. Our responses aim to strengthen the empirical rigor of the presented results while maintaining the core contributions of Terminus-4B.

read point-by-point responses

Referee: Abstract and Evaluation sections: the headline claims of ~30% token reduction, unchanged benchmark performance, and superior subagent metrics are stated without accompanying numerical tables, confidence intervals, statistical tests, or ablation results that isolate the contribution of the rubric-based LLM-as-judge reward versus direct outcome signals (e.g., command success or test-pass rates).

Authors: We appreciate this feedback on the presentation of results. The Evaluation section of the manuscript includes multiple tables reporting exact token reduction percentages (e.g., 28-32% across configurations), benchmark scores for SWE-Bench Pro and the C# benchmark, and subagent reliance metrics for Terminus-4B versus baselines and frontier models. However, we acknowledge the absence of confidence intervals and formal statistical tests in the current version. In the revised manuscript, we will add 95% confidence intervals to key metrics and perform paired statistical tests to confirm significance of the ~30% reduction and performance parity. Regarding ablations for the reward signal, the Training and Experiments sections describe comparisons between SFT, RL with rubric rewards, and variants; we will expand these to explicitly include direct outcome-based rewards (e.g., binary success signals) and report their impact on token usage and execution quality. revision: partial
Referee: Methods and Training sections: no correlation analysis, cross-judge validation, or human evaluation on held-out terminal traces is reported to establish that the LLM-as-judge rubric produces signals that improve genuine execution quality rather than merely outputs favored by the judge model.

Authors: We agree that validating the LLM-as-judge is crucial for the RL component. The Methods section details the rubric criteria, which are designed to align with objective execution outcomes such as command success and error resolution. To address the lack of correlation analysis, we will include in the revision a correlation study on 150 held-out terminal execution traces, computing Pearson correlations between rubric scores and ground-truth metrics like test passage rates. We will also perform cross-judge validation by re-scoring a subset with an alternative judge model. Human evaluation on traces was not performed in the original work due to resource constraints for expert review of complex terminal logs; we will note this as a limitation and suggest it as future work, while relying on the objective benchmarks for validation. revision: yes
Referee: Experiments section: the evaluation lacks controls for benchmark overfitting or judge-specific bias, such as ablations replacing the rubric reward with outcome-based rewards or testing generalization on tasks outside the training distribution.

Authors: Our evaluation does include controls via testing on two distinct benchmarks (SWE-Bench Pro and internal C#) and varying main-agent setups to demonstrate robustness. However, we recognize the need for more explicit ablations on reward types and out-of-distribution testing. In the revised Experiments section, we will add results from an ablation where the rubric reward is replaced with pure outcome-based rewards (e.g., success/failure from test execution), showing comparative performance. Additionally, we will report generalization results on a set of non-SWE terminal tasks (e.g., system administration commands) to address potential overfitting to the training distribution. These additions will help isolate judge-specific effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical training and benchmark evaluation

full rationale

The paper presents Terminus-4B as the result of post-training Qwen3-4B via SFT followed by RL with a rubric-based LLM-as-judge reward, then reports direct empirical outcomes on external benchmarks (SWE-Bench Pro, internal SWE-Bench C#) and comparisons to frontier models. No derivations, first-principles predictions, equations, or fitted parameters are claimed or used; all headline metrics (token reduction, performance parity, subagent reliance) are measured outcomes rather than quantities defined by the training process itself. No self-citations form load-bearing premises, no uniqueness theorems are invoked, and no ansatz or renaming of known results occurs. The work is self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The LLM-as-judge reward may implicitly depend on rubric design choices that are not detailed.

pith-pipeline@v0.9.0 · 5601 in / 1292 out tokens · 78698 ms · 2026-05-08T18:05:19.549400+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.AlphaCoordinateFixation washburn_uniqueness_aczel; alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The final reward blends the scores for execution quality with final answer quality: r = (1−α)(s̄_pos − s̄_pit) + α·s̄_fa ... We use an α=0.5.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

33 extracted references · 24 canonical work pages · 13 internal anchors

[1]

VSCode Agent Mode,

Microsoft, “VSCode Agent Mode, ” https://code.visualstudio.com/blogs/ 2025/04/07/agentMode, 2025, accessed: 2025-09-28

2025
[2]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig, “Opendevin: An open platform for ai software developers as generalist agents, ” 2024. [Online]. Available: https://arxiv.org/abs/2...

work page internal anchor Pith review arXiv 2024
[3]

Claude for Coding,

Anthropic, “Claude for Coding, ” https://www.anthropic.com/claude-code, 2024, accessed: 2025-07-14

2024
[4]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering, ” 2024. [Online]. Available: https://arxiv.org/abs/ 2405.15793

work page internal anchor Pith review arXiv 2024
[5]

Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?

S. Garg and Y. Huang, “Debug2fix: Can interactive debugging help coding agents fix more bugs?” 2026. [Online]. Available: https: //arxiv.org/abs/2602.18571

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, 12 M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, ...

work page internal anchor Pith review arXiv 2025
[7]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y. K. Li, Y. Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models, ” 2024. [Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review arXiv 2024
[8]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena, ” 2023. [Online]. Available: https://arxiv.org/abs/2306.05685

work page internal anchor Pith review arXiv 2023
[9]

LLM -rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts

H. Hashemi, J. Eisner, C. Rosset, B. Van Durme, and C. Kedzie, “Llm-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024, p. 13806–13834. [Online]...

work page doi:10.18653/v1/2024.acl-long.745 2024
[10]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback, ” 2022. [Online]. Available: https://arxiv.org/abs/2203.02155

work page internal anchor Pith review arXiv 2022
[11]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang, “Autogen: Enabling next-gen llm applications via multi-agent conversation, ” 2023. [Online]. Available: https://arxiv.org/abs/2308.08155

work page internal anchor Pith review arXiv 2023
[12]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “Metagpt: Meta programming for a multi-agent collaborative framework, ” 2024. [Online]. Available: https: //arxiv.org/abs/2308.00352

work page internal anchor Pith review arXiv 2024
[13]

ChatDev: Communicative Agents for Software Development

C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun, “Chatdev: Communicative agents for software development, ” 2024. [Online]. Available: https://arxiv.org/abs/2307.07924

work page internal anchor Pith review arXiv 2024
[14]

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead, 2024, [arXiv:cs.SE/2404.04834]

J. He, C. Treude, and D. Lo, “Llm-based multi-agent systems for software engineering: Literature review, vision and the road ahead, ” 2025. [Online]. Available: https://arxiv.org/abs/2404.04834

work page arXiv 2025
[15]

(2025) How we built our multi-agent research system

Anthropic. (2025) How we built our multi-agent research system. Published June 13, 2025. [Online]. Available: https://www.anthropic.com/ engineering/built-multi-agent-research-system

2025
[16]

Small language models are the future of agentic ai,

P. Belcak, G. Heinrich, S. Diao, Y. Fu, X. Dong, S. Muralidharan, Y. C. Lin, and P. Molchanov, “Small language models are the future of agentic ai, ”
[17]

InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 13921–13937, Bangkok, Thailand

[Online]. Available: https://arxiv.org/abs/2506.02153

work page arXiv
[18]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Gu...
[19]

Available: http://dx.doi.org/10.1038/s41586-025-09422-z

[Online]. Available: http://dx.doi.org/10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z
[20]

arXiv preprint arXiv:2503.16219

Q.-A. Dang and C. Ngo, “Reinforcement learning for reasoning in small llms: What works and what doesn’t, ” 2026. [Online]. Available: https://arxiv.org/abs/2503.16219

work page arXiv 2026
[21]

Qwen3-coder-next technical report,

R. Cao, M. Chen, J. Chen, Z. Cui, Y. Feng, B. Hui, Y. Jing, K. Li, M. Li, J. Lin, Z. Ma, K. Shum, X. Wang, J. Wei, J. Yang, J. Zhang, L. Zhang, Z. Zhang, W. Zhao, and F. Zhou, “Qwen3-coder-next technical report, ”
[22]

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

[Online]. Available: https://arxiv.org/abs/2603.00729

work page arXiv
[23]

Skyrl-agent: Efficient rl training for multi-turn llm agent.arXiv preprint arXiv:2511.16108, 2025

S. Cao, D. Li, F. Zhao, S. Yuan, S. R. Hegde, C. Chen, C. Ruan, T. Griggs, S. Liu, E. Tang, R. Liaw, P. Moritz, M. Zaharia, J. E. Gonzalez, and I. Stoica, “Skyrl-agent: Efficient rl training for multi-turn llm agent, ” 2025. [Online]. Available: https://arxiv.org/abs/2511.16108

work page arXiv 2025
[24]

Goodman, and Dimitris Papailiopoulos

K. Gandhi, S. Garg, N. D. Goodman, and D. Papailiopoulos, “Endless terminals: Scaling rl environments for terminal agents, ” 2026. [Online]. Available: https://arxiv.org/abs/2601.16443

work page arXiv 2026
[25]

Active context compression: Autonomous memory management in LLM agents

N. Verma, “Active context compression: Autonomous memory management in llm agents, ” 2026. [Online]. Available: https: //arxiv.org/abs/2601.07190

work page arXiv 2026
[26]

Scaling long-horizon LLM agent via context-folding.CoRR, abs/2510.11967, 2025

W. Sun, M. Lu, Z. Ling, K. Liu, X. Yao, Y. Yang, and J. Chen, “Scaling long-horizon llm agent via context-folding, ” 2025. [Online]. Available: https://arxiv.org/abs/2510.11967

work page arXiv 2025
[27]

Memex(rl): Scaling long-horizon llm agents via indexed experience memory,

Z. Wang, H. Chen, J. Wang, and W. Wei, “Memex(rl): Scaling long-horizon llm agents via indexed experience memory, ” 2026. [Online]. Available: https://arxiv.org/abs/2603.04257

work page arXiv 2026
[28]

SWE Context Bench: A Benchmark for Context Learning in Coding

J. Zhu, M. Hu, and J. Wu, “Swe context bench: A benchmark for context learning in coding, ” 2026. [Online]. Available: https: //arxiv.org/abs/2602.08316

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Fireworks AI: Fast inference platform,

Fireworks AI, “Fireworks AI: Fast inference platform, ” https://fireworks.ai, 2025

2025
[30]

Slime: Distributed training framework,

THUDM, “Slime: Distributed training framework, ” https://github.com/ THUDM/slime, 2025

2025
[31]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Q. Yu, Z. Zhang, R. Zhu, Y. Yuan, X. Zuo, Y. Yue, W. Dai, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y. Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, Y. Song, X. Wei, H. Zhou, J. Liu, W.-Y. Ma, Y.-Q. Zhang, L. Yan, M. Qiao, Y. Wu, and M. Wang, “Dapo: An open-source llm reinforcement learning sy...

work page internal anchor Pith review arXiv 2025
[32]

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

X. Deng, J. Da, E. Pan, Y. Y. He, C. Ide, K. Garg, N. Lauffer, A. Park, N. Pasari, C. Rane, K. Sampath, M. Krishnan, S. Kundurthy, S. Hendryx, Z. Wang, V. Bharadwaj, J. Holm, R. Aluri, C. B. C. Zhang, N. Jacobson, B. Liu, and B. Kenstler, “Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?” 2025. [Online]. Available: https://arxiv...

work page internal anchor Pith review arXiv 2025
[33]

Introducing swe-bench verified,

OpenAI, “Introducing swe-bench verified, ” https://openai.com/index/ introducing-swe-bench-verified/, 2024, published August 13, 2024; updated February 24, 2025. [Online]. Available: https://openai.com/index/ introducing-swe-bench-verified/ 13

2024