Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution

Aliya Ahmad; Reyhaneh Jabbarvand; Yang Chen; Yiheng Zhou

arxiv: 2606.25514 · v1 · pith:K4TGCH3Bnew · submitted 2026-06-24 · 💻 cs.SE

Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution

Yang Chen , Aliya Ahmad , Yiheng Zhou , Reyhaneh Jabbarvand This is my paper

Pith reviewed 2026-06-25 20:12 UTC · model grok-4.3

classification 💻 cs.SE

keywords multi-agent scaffoldingsoftware issue resolutioncontext managementSWE-benchagentic workflowsbug fixingadaptive agentsdecentralized agents

0 comments

The pith

A decentralized multi-agent scaffold with event-based messaging and rubric-based branching outperforms baselines on SWE-bench using identical models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces icat-agent to address ambiguous software issues that require long-horizon navigation of codebases. It replaces shared context with synchronous event-based message passing to reduce degradation and poisoning. A rubric-based check classifies issue quality and routes well-defined cases to parallel patching while sending low-quality cases to preliminary exploration. Evaluations show consistent gains over agents such as SWE-agent and Claude Code on SWE-bench Verified and Pro, together with lower per-instance cost. The central demonstration is that improved scaffolding extracts substantially more resolutions from a fixed underlying model.

Core claim

icat-agent is a decentralized multi-agent scaffolding that replaces shared context with synchronous, event-based message passing and uses a rubric-based issue quality check to pivot its workflow: parallel patching and validation for well-defined issues, or preliminary exploration for low-quality ones. On SWE-bench Verified and SWE-bench Pro it outperforms baselines including SWE-agent, mini-SWE-agent, and Claude Code while using the same models, with gains of 3.6-8.4 percent and 6.3-18.5 percent respectively, and reduces average cost by $1.18 per instance. The same backbone resolves markedly more issues under icat-agent than under existing scaffolds, reaching 67.4 percent on SWE-bench Pro.

What carries the argument

icat-agent's decentralized scaffolding that substitutes shared context with synchronous event-based message passing and applies rubric-based quality classification to adapt between patching and exploration workflows.

If this is right

icat-agent raises resolution rates by 3.6-8.4% on SWE-bench Verified and 6.3-18.5% on SWE-bench Pro across all difficulty levels compared with baselines using identical models.
Average cost per instance drops by $1.18 relative to the multi-agent Claude Code baseline.
With GPT-5.4-xhigh the scaffold reaches 67.4% on SWE-bench Pro, 8.3 points above the prior best result.
The same model backbone solves substantially more issues under icat-agent than under existing scaffolds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Event-based messaging patterns may reduce context loss in multi-agent systems applied to domains beyond software repair, such as automated scientific hypothesis testing.
Varying the rubric criteria or adding learned quality predictors could refine branching decisions for specific issue distributions.
The observed cost reduction suggests the approach could scale to repositories larger than those in current benchmarks without linear growth in expense.

Load-bearing premise

The rubric-based issue quality check reliably classifies issues to select the correct workflow branch and synchronous event-based message passing prevents context degradation and poisoning more effectively than shared context.

What would settle it

Replace the rubric check with random branching or swap event-based messaging for shared context on the same SWE-bench instances and measure whether the reported performance margins over baselines disappear.

Figures

Figures reproduced from arXiv: 2606.25514 by Aliya Ahmad, Reyhaneh Jabbarvand, Yang Chen, Yiheng Zhou.

**Figure 1.** Figure 1: Example of (a) test overfitting (sympy-21596) and (b) patch overfitting (django-13964). (a) A weak test only exercises the single example mentioned in the issue, and the patch overfits to that narrow behavior and fails broader cases. (b) A wrong patch removes the BETWEEN optimization; the generated test checks the SQL pattern introduced by the patch rather than the actual query result. decomposes the prima… view at source ↗

**Figure 2.** Figure 2: Analysis of issue descriptions in SWE-bench Verified (all [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of icat-agent. We define the automated issue resolution task as ⟨𝐼𝐹 , 𝑅, 𝐵⟩. 𝐼𝐹 is the Issue Description with a quality score 𝐹 ∈ (0, 1]. A lower 𝐹 represents higher ambiguity (missing files, vague symptoms). 𝑅 and 𝐵 represent Repository State and Bug, respectively. Given 𝐼𝐹 , an agentic system generates Trajectory 𝐻 to resolve 𝐵, where the length of trajectory (in terms of tokens), |𝐻|, is pr… view at source ↗

**Figure 4.** Figure 4: A high-quality issue description from SWE-bench Pro [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: A low-quality issue description from SWE-bench Verified [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Explorer context summary for the low-quality issue in Figure 5. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Example of information shared by Validator. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 9.** Figure 9: Resolution rate across problem difficulty levels. Non-code files are excluded from the SWE-bench Pro [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 8.** Figure 8: Breakdown of success rate per different programming languages on [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 11.** Figure 11: Exclusive fixes across different techniques on (a, b) SWE-bench Verified and (c, d) SWE-bench Pro. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 10.** Figure 10: Distribution of resolution outcomes across issue-quality [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

read the original abstract

Resolving issues with ambiguous and incomplete descriptions, particularly concerning complex bugs, requires a sophisticated, long-horizon workflow. Agents must navigate codebases to locate the root cause, reproduce the failure, implement a fix, and validate the resulting patch. Inefficient context management, thereby, can lead to rapid context degradation and context poisoning, preventing successful resolution. We propose icat-agent, a decentralized, multi-agent scaffolding that replaces shared context with synchronous, event-based message passing. Utilizing a rubric-based issue quality check, icat-agent strategically pivots its workflow: it initiates parallel patching and validation for well-defined issues, while deploying preliminary exploration for low-quality ones. A comprehensive evaluation of icat-agent on SWE-bench Verified and SWE-bench Pro demonstrates that it consistently outperforms prominent baselines across all difficulty levels, including SWE-agent, mini-SWE-agent, and Claude Code, while using the same underlying models, improving by 3.6-8.4% on SWE-bench Verified and 6.3-18.5% on SWE-bench Pro. icat-agent is also computationally efficient, reducing the average cost by $1.18 per instance compared with the multi-agent Claude Code baseline. Our findings reveal that a robust scaffold such as icat-agent unlocks substantial latent capability within a fixed model, with the same backbone resolving markedly more issues under icat-agent than under existing scaffolds. icat-agent +GPT-5.4-xhigh resolves 67.4% of SWE-bench Pro problems, outperforming the current best result on SWE-bench Pro (59.10%, mini-SWE-agent+GPT-5.4-xhigh) by 8.3 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

icat-agent shows concrete gains on SWE-bench via event messaging and rubric branching, but the rubric's accuracy is unverified so the attribution of the delta remains shaky.

read the letter

The paper's actual advance is the switch to synchronous event-based messaging in place of shared context, paired with a rubric that decides whether to run parallel patching or first do exploration. That combination produces the reported lifts: 3.6-8.4% on SWE-bench Verified and 6.3-18.5% on Pro, all with the same backbone models as the baselines, plus a claimed $1.18 lower average cost.

The work is straightforward engineering. It names the baselines (SWE-agent, mini-SWE-agent, Claude Code), keeps the model fixed, and measures both success rate and cost. Those are the parts that hold up on the given evidence.

The soft spot is exactly where the stress-test flagged it. The adaptive pivot depends on the rubric correctly separating well-defined issues from low-quality ones, yet the abstract supplies no precision, recall, agreement study, or error analysis for that classifier. Without that, you cannot tell whether the gains come from the branching logic or from simply running one of the branches more often. The abstract also omits run counts, variance, or any check against benchmark leakage.

If the full manuscript contains ablations that isolate the rubric and the messaging change, the central claim strengthens. As written, the performance numbers are presented but the mechanism that supposedly produces them is not yet pinned down.

This is for people building multi-agent coding systems who want a worked example of decentralized messaging and workflow branching. It is solid enough on the engineering side to go to a referee, though the methods and results sections will need direct examination on the rubric validation and statistical reporting.

Referee Report

3 major / 2 minor

Summary. The paper introduces icat-agent, a decentralized multi-agent scaffolding for software issue resolution that replaces shared context with synchronous event-based message passing and uses a rubric-based issue quality classifier to adaptively branch: parallel patching/validation for well-defined issues versus preliminary exploration for low-quality ones. It reports consistent outperformance over SWE-agent, mini-SWE-agent, and Claude Code (same underlying models) on SWE-bench Verified (+3.6–8.4%) and SWE-bench Pro (+6.3–18.5%), plus cost reduction ($1.18/instance vs. Claude Code) and a new SOTA of 67.4% on Pro with GPT-5.4-xhigh.

Significance. If the performance deltas are robustly attributable to the adaptive scaffolding rather than unablated factors, the result would indicate that carefully designed multi-agent workflows can unlock substantial latent capability in fixed LLMs on long-horizon software engineering tasks, with direct implications for agentic systems in SE.

major comments (3)

[§3.2] §3.2 (Rubric-based Issue Quality Check): The manuscript defines the rubric and the pivot logic but reports no quantitative validation—no accuracy/precision/recall on a labeled SWE-bench subset, no inter-annotator agreement, and no error analysis of misclassifications. Because the claimed gains rest on the classifier correctly triggering the appropriate branch, this omission prevents attribution of the 3.6–18.5% improvements to the adaptive mechanism.
[§5] §5 (Experimental Evaluation): The results section presents aggregate percentage improvements and a cost comparison but supplies no statistical significance tests, confidence intervals, per-difficulty breakdowns with error bars, or ablation studies isolating the contribution of event-based messaging versus the quality-check pivot versus other design choices.
[§4.1] §4.1 (Message Passing Design): The claim that synchronous event-based passing “sufficiently prevents context degradation and poisoning” is asserted without supporting measurements (e.g., context-length traces, poisoning incident rates, or comparison against shared-context baselines under identical model settings).

minor comments (2)

[Table 1] Table 1 and Figure 3: axis labels and legend entries use inconsistent model shorthand (e.g., “GPT-5.4-xhigh” vs. “Claude-3.5”) that should be standardized for readability.
[§2] §2 (Related Work): The discussion of prior multi-agent scaffolds omits recent SWE-bench-specific ablations on context management; adding 2–3 targeted citations would strengthen positioning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each of the major comments below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Rubric-based Issue Quality Check): The manuscript defines the rubric and the pivot logic but reports no quantitative validation—no accuracy/precision/recall on a labeled SWE-bench subset, no inter-annotator agreement, and no error analysis of misclassifications. Because the claimed gains rest on the classifier correctly triggering the appropriate branch, this omission prevents attribution of the 3.6–18.5% improvements to the adaptive mechanism.

Authors: We agree that quantitative validation is necessary to robustly attribute the gains to the adaptive mechanism. In the revised manuscript, we will add a dedicated evaluation of the issue quality classifier on a labeled subset of SWE-bench, including accuracy, precision, recall, inter-annotator agreement metrics, and error analysis of any misclassifications. revision: yes
Referee: [§5] §5 (Experimental Evaluation): The results section presents aggregate percentage improvements and a cost comparison but supplies no statistical significance tests, confidence intervals, per-difficulty breakdowns with error bars, or ablation studies isolating the contribution of event-based messaging versus the quality-check pivot versus other design choices.

Authors: The current manuscript focuses on aggregate results, but we acknowledge the value of additional statistical rigor and component analysis. We will revise §5 to include statistical significance tests, confidence intervals, per-difficulty breakdowns with error bars, and ablation studies for the key design choices including event-based messaging and the quality-check pivot. revision: yes
Referee: [§4.1] §4.1 (Message Passing Design): The claim that synchronous event-based passing “sufficiently prevents context degradation and poisoning” is asserted without supporting measurements (e.g., context-length traces, poisoning incident rates, or comparison against shared-context baselines under identical model settings).

Authors: We will strengthen this section by adding supporting measurements such as context-length traces during execution, observed poisoning incident rates, and more detailed comparisons to shared-context approaches under matched conditions. This will be included in the revised version to better substantiate the design choice. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical benchmark comparisons

full rationale

The manuscript advances an empirical claim that icat-agent outperforms named baselines (SWE-agent, mini-SWE-agent, Claude Code) on SWE-bench Verified and Pro by measured percentages, using identical underlying models. No derivation chain, equations, fitted parameters renamed as predictions, or self-citations are invoked to support the performance deltas; the results are presented as outcomes of controlled experiments. The rubric-based classifier is described as a design choice whose accuracy is not quantified in the provided text, but this is an evidence gap rather than a circular reduction. The central result therefore remains self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, fitted constants, or postulated entities; the contribution is an empirical engineering system whose internal parameters (e.g., rubric thresholds) are not disclosed.

pith-pipeline@v0.9.1-grok · 5851 in / 1121 out tokens · 31518 ms · 2026-06-25T20:12:41.469214+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 2 canonical work pages

[1]

https://tree-sitter.github.io/tree-sitter/

Tree-sitter. https://tree-sitter.github.io/tree-sitter/
[2]

Tdd-bench verified: Can llms generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883, 2024

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. Tdd-bench verified: Can llms generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883, 2024

arXiv 2024
[3]

Otter: Generating tests from issues to validate swe patches.ICML, 2025

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. Otter: Generating tests from issues to validate swe patches.ICML, 2025

2025
[4]

Claude Sonnet 4.5 system card

Anthropic. Claude Sonnet 4.5 system card. Technical report, Anthropic, September 2025. URL https://www.anthropic. com/claude-sonnet-4-5-system-card

2025
[5]

Anthropic agent teams

Anthropic. Anthropic agent teams. https://code.claude.com/docs/en/agent-teams, 2026

2026
[6]

MASAI: Modular architecture for software-engineering AI agents

Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, and Nagarajan Natarajan. MASAI: Modular architecture for software-engineering AI agents. InNeurIPS 2024 Workshop on Open-World Agents (OW A), 2024

2024
[7]

URL https://github.com/Intelligent-CAT-Lab/icat-agent

Artifact, 2026. URL https://github.com/Intelligent-CAT-Lab/icat-agent

2026
[8]

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei An- driushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URL https://arxiv.org/abs/2505.20411

arXiv 2025
[9]

CodeR: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304, 2024

Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. CodeR: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304, 2024

arXiv 2024
[10]

Can old tests do new tricks for resolving swe issues?, 2026

Yang Chen, Toufique Ahmed, Reyhaneh Jabbarvand, and Martin Hirzel. Can old tests do new tricks for resolving swe issues?, 2026

2026
[11]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified. OpenAI, 2024. URL https://openai.com/index/introducing-swe-bench- verified/

2024
[12]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, , Vol. 1, No. 1, Article . Publication date: June 2026. Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Eff...

Pith/arXiv arXiv 2026
[13]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

2024
[14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024
[15]

Langgraph

LangChain. Langgraph. https://github.com/langchain-ai/langgraph, 2026

2026
[16]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12: 157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12: 157–173, 2024

2024
[17]

https://www.minimax.io/news/minimax-m25, 2026

MiniMax. https://www.minimax.io/news/minimax-m25, 2026

2026
[18]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

2024
[19]

GPT-5 system card

OpenAI. GPT-5 system card. https://openai.com/index/gpt-5-system-card/, 2025. Accessed: 2026-05-01

2025
[20]

GPT-5.4 system card

OpenAI. GPT-5.4 system card. https://openai.com/index/introducing-gpt-5-4/, 2025

2025
[21]

Nguyen, and Nghi D

Huy Nhat Phan, Phong X. Nguyen, and Nghi D. Q. Bui. HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

arXiv 2024
[22]

(2024) Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186. As...

work page doi:10.18653/v1/2024.acl-long.810 2024
[23]

Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025. URL https://arxiv.org/abs/2504.08703

arXiv 2025
[24]

Mini-SWE-Agent: A minimal agent scaffold for software engineering

SWE-Agent Team. Mini-SWE-Agent: A minimal agent scaffold for software engineering. https://github.com/SWE- agent/mini-swe-agent, 2025

2025
[25]

Context rot: Why ai gets worse the longer you chat

Teresa Torres. Context rot: Why ai gets worse the longer you chat. https://www.producttalk.org/context-rot/, 2026

2026
[26]

Ambig-swe: Interactive agents to overcome underspecificity in software engineering

Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, and Graham Neubig. Ambig-swe: Interactive agents to overcome underspecificity in software engineering. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[27]

How we broke top ai agent benchmarks: And what comes next

Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. How we broke top ai agent benchmarks: And what comes next. https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont, 2026

2026
[28]

Aegis: An agent-based framework for bug reproduction from issue descriptions

Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. Aegis: An agent-based framework for bug reproduction from issue descriptions. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 331–342, 2025

2025
[29]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025

2025
[30]

The long-horizon task mirage? diagnosing where and why agentic systems break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break. arXiv preprint arXiv:2604.11978, 2026

Pith/arXiv arXiv 2026
[31]

Live-swe-agent: Can software engineering agents self-evolve on the fly?, 2025

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly?, 2025. URL https://arxiv.org/abs/2511.13646

arXiv 2025
[32]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-Agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024
[33]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URL https://arxiv.org/abs/2410.03859

arXiv 2024
[34]

Multi-swe-bench: A multilingual benchmark for issue resolving, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URL https://arxiv.org/abs/2504.02605. , Vol. 1, No. 1, ...

Pith/arXiv arXiv 2025
[35]

Swe-bench goes live!Neurips 2025, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!Neurips 2025, 2025

2025
[36]

Swe-bench goes live!, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!, 2025. URL https://arxiv.org/abs/2505.23419

arXiv 2025
[37]

AutoCodeRover: Autonomous program im- provement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Autonomous program im- provement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA),
[38]

doi: 10.1145/3650212.3680384. , Vol. 1, No. 1, Article . Publication date: June 2026. Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution 21 A Details of Implementations A.1 Details of Tools In this section, we discuss the details of the tools designed and used in the icat-agent. icat-agent implements a set o...

work page doi:10.1145/3650212.3680384 2026
[39]

Do not only inspect the function mentioned in the plan

Trace the full code path before editing. Do not only inspect the function mentioned in the plan. - Use trace_call_chain() to find callers and callees. - Read the metaclass, factory, base class, or option parser if involved. - Understand how the value flows from definition to processing to usage
[40]

Do not edit test files

Modify source code only. Do not edit test files
[41]

Think about edge cases and edit multiple files if needed
[42]

Ensure each edit affects only the intended code region
[43]

patch_generated

When the patch is ready, call: share_findings("patch_generated", "<description>")
[44]

## Inter-agent Communication - Reproducer findings provide reproduction behavior and validation results

If validation fails, revise the patch and share a new patch_generated finding. ## Inter-agent Communication - Reproducer findings provide reproduction behavior and validation results. - If validation fails, reflect on the feedback and revise the patch. - Do not declare completion until the reproducer confirms that all tests pass. - Carefully decide what t...

2026
[45]

- Search broadly for tests in the affected module or package

Identify and run existing regression tests related to the issue. - Search broadly for tests in the affected module or package. - Run the affected package's test suite, not only individual tests. - Register regression tests and run them before any fix. - Share baseline results with the other agents
[46]

bug_confirmed

Write and run a comprehensive reproduction script. - Use existing tests to learn setup patterns. - Test each scenario separately. - Each scenario must have its own assertion and failure message. - Avoid combining configurations in one test, because this can mask bugs. - When the bug is reproduced, call: share_findings("bug_confirmed", "<details>")
[47]

Wait for the patch editor's fix and call apply_patch() to apply it
[48]

Thoroughly validate the patch. a. First check that the patched code compiles or passes a smoke test. b. Re-run the reproduction script. c. Run all registered regression tests. d. Run additional related tests for modules touched by the patch. e. Add edge-case tests for scenarios mentioned in the issue description. f. If the issue mentions multiple scenario...
[49]

validation_passed

Share validation results. - If the bug is fixed and all tests pass, call: share_findings("validation_passed", "<summary>") - If any test fails, any error occurs, or the bug remains, call: share_findings("validation_failed", "<specific failure details>") Then call apply_patch() again to wait for the revised patch. - IMPORTANT: After a validation failure, R...

2026
[50]

Choices",

Localization hints Determine whether the issue explicitly or implicitly mentions buggy files, classes, functions, methods, stack traces, modules, or code locations. **buggy_files**: List of FULL file paths mentioned or strongly implied **buggy_classes**: List of classes mentioned or implied in the issue (e.g. "Choices", "IntegerChoices"). **buggy_function...
[51]

Repair strategy Determine whether the issue describes a fix strategy, expected code change, or implementation tip
[52]

quality":

Reproduction hints Determine whether the issue includes reproduction steps, input examples, failing commands, expected behavior, actual behavior, stack traces, or test cases. Classify the issue quality as one of: - high: clear localization, repair, and reproduction information; - low: partial information, but additional repository exploration is needed; R...

2026

[1] [1]

https://tree-sitter.github.io/tree-sitter/

Tree-sitter. https://tree-sitter.github.io/tree-sitter/

[2] [2]

Tdd-bench verified: Can llms generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883, 2024

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. Tdd-bench verified: Can llms generate tests for issues before they get resolved?arXiv preprint arXiv:2412.02883, 2024

arXiv 2024

[3] [3]

Otter: Generating tests from issues to validate swe patches.ICML, 2025

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. Otter: Generating tests from issues to validate swe patches.ICML, 2025

2025

[4] [4]

Claude Sonnet 4.5 system card

Anthropic. Claude Sonnet 4.5 system card. Technical report, Anthropic, September 2025. URL https://www.anthropic. com/claude-sonnet-4-5-system-card

2025

[5] [5]

Anthropic agent teams

Anthropic. Anthropic agent teams. https://code.claude.com/docs/en/agent-teams, 2026

2026

[6] [6]

MASAI: Modular architecture for software-engineering AI agents

Daman Arora, Atharv Sonwane, Nalin Wadhwa, Abhav Mehrotra, Saiteja Utpala, Ramakrishna Bairi, Aditya Kanade, and Nagarajan Natarajan. MASAI: Modular architecture for software-engineering AI agents. InNeurIPS 2024 Workshop on Open-World Agents (OW A), 2024

2024

[7] [7]

URL https://github.com/Intelligent-CAT-Lab/icat-agent

Artifact, 2026. URL https://github.com/Intelligent-CAT-Lab/icat-agent

2026

[8] [8]

Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025

Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei An- driushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel. Swe-rebench: An automated pipeline for task collection and decontaminated evaluation of software engineering agents, 2025. URL https://arxiv.org/abs/2505.20411

arXiv 2025

[9] [9]

CodeR: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304, 2024

Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al. CodeR: Issue resolving with multi-agent and task graphs.arXiv preprint arXiv:2406.01304, 2024

arXiv 2024

[10] [10]

Can old tests do new tricks for resolving swe issues?, 2026

Yang Chen, Toufique Ahmed, Reyhaneh Jabbarvand, and Martin Hirzel. Can old tests do new tricks for resolving swe issues?, 2026

2026

[11] [11]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified. OpenAI, 2024. URL https://openai.com/index/introducing-swe-bench- verified/

2024

[12] [12]

Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, , Vol. 1, No. 1, Article . Publication date: June 2026. Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Eff...

Pith/arXiv arXiv 2026

[13] [13]

Metagpt: Meta programming for a multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for a multi-agent collaborative framework. InInternational Conference on Learning Representations, 2024

2024

[14] [14]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations (ICLR), 2024

2024

[15] [15]

Langgraph

LangChain. Langgraph. https://github.com/langchain-ai/langgraph, 2026

2026

[16] [16]

Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12: 157–173, 2024

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12: 157–173, 2024

2024

[17] [17]

https://www.minimax.io/news/minimax-m25, 2026

MiniMax. https://www.minimax.io/news/minimax-m25, 2026

2026

[18] [18]

Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

Niels Mündler, Mark N Müller, Jingxuan He, and Martin Vechev. Swt-bench: Testing and validating real-world bug-fixes with code agents.Advances in Neural Information Processing Systems, 37:81857–81887, 2024

2024

[19] [19]

GPT-5 system card

OpenAI. GPT-5 system card. https://openai.com/index/gpt-5-system-card/, 2025. Accessed: 2026-05-01

2025

[20] [20]

GPT-5.4 system card

OpenAI. GPT-5.4 system card. https://openai.com/index/introducing-gpt-5-4/, 2025

2025

[21] [21]

Nguyen, and Nghi D

Huy Nhat Phan, Phong X. Nguyen, and Nghi D. Q. Bui. HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

arXiv 2024

[22] [22]

(2024) Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186. As...

work page doi:10.18653/v1/2024.acl-long.810 2024

[23] [23]

Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025

Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, and Laurent Callot. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents, 2025. URL https://arxiv.org/abs/2504.08703

arXiv 2025

[24] [24]

Mini-SWE-Agent: A minimal agent scaffold for software engineering

SWE-Agent Team. Mini-SWE-Agent: A minimal agent scaffold for software engineering. https://github.com/SWE- agent/mini-swe-agent, 2025

2025

[25] [25]

Context rot: Why ai gets worse the longer you chat

Teresa Torres. Context rot: Why ai gets worse the longer you chat. https://www.producttalk.org/context-rot/, 2026

2026

[26] [26]

Ambig-swe: Interactive agents to overcome underspecificity in software engineering

Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, and Graham Neubig. Ambig-swe: Interactive agents to overcome underspecificity in software engineering. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[27] [27]

How we broke top ai agent benchmarks: And what comes next

Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. How we broke top ai agent benchmarks: And what comes next. https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont, 2026

2026

[28] [28]

Aegis: An agent-based framework for bug reproduction from issue descriptions

Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. Aegis: An agent-based framework for bug reproduction from issue descriptions. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, pages 331–342, 2025

2025

[29] [29]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. OpenHands: An open platform for AI software developers as generalist agents. InInternational Conference on Learning Representations (ICLR), 2025

2025

[30] [30]

The long-horizon task mirage? diagnosing where and why agentic systems break

Xinyu Jessica Wang, Haoyue Bai, Yiyou Sun, Haorui Wang, Shuibai Zhang, Wenjie Hu, Mya Schroder, Bilge Mutlu, Dawn Song, and Robert D Nowak. The long-horizon task mirage? diagnosing where and why agentic systems break. arXiv preprint arXiv:2604.11978, 2026

Pith/arXiv arXiv 2026

[31] [31]

Live-swe-agent: Can software engineering agents self-evolve on the fly?, 2025

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang. Live-swe-agent: Can software engineering agents self-evolve on the fly?, 2025. URL https://arxiv.org/abs/2511.13646

arXiv 2025

[32] [32]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-Agent: Agent-computer interfaces enable automated software engineering.Advances in Neural Information Processing Systems, 37:50528–50652, 2024

2024

[33] [33]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024. URL https://arxiv.org/abs/2410.03859

arXiv 2024

[34] [34]

Multi-swe-bench: A multilingual benchmark for issue resolving, 2025

Daoguang Zan, Zhirong Huang, Wei Liu, Hanwu Chen, Linhao Zhang, Shulin Xin, Lu Chen, Qi Liu, Xiaojian Zhong, Aoyan Li, Siyao Liu, Yongsheng Xiao, Liangqiang Chen, Yuyu Zhang, Jing Su, Tianyu Liu, Rui Long, Kai Shen, and Liang Xiang. Multi-swe-bench: A multilingual benchmark for issue resolving, 2025. URL https://arxiv.org/abs/2504.02605. , Vol. 1, No. 1, ...

Pith/arXiv arXiv 2025

[35] [35]

Swe-bench goes live!Neurips 2025, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!Neurips 2025, 2025

2025

[36] [36]

Swe-bench goes live!, 2025

Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. Swe-bench goes live!, 2025. URL https://arxiv.org/abs/2505.23419

arXiv 2025

[37] [37]

AutoCodeRover: Autonomous program im- provement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover: Autonomous program im- provement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA),

[38] [38]

doi: 10.1145/3650212.3680384. , Vol. 1, No. 1, Article . Publication date: June 2026. Unlocking Model Potentials Through Adaptive Multi-Agent Scaffolding for Efficient Issue Resolution 21 A Details of Implementations A.1 Details of Tools In this section, we discuss the details of the tools designed and used in the icat-agent. icat-agent implements a set o...

work page doi:10.1145/3650212.3680384 2026

[39] [39]

Do not only inspect the function mentioned in the plan

Trace the full code path before editing. Do not only inspect the function mentioned in the plan. - Use trace_call_chain() to find callers and callees. - Read the metaclass, factory, base class, or option parser if involved. - Understand how the value flows from definition to processing to usage

[40] [40]

Do not edit test files

Modify source code only. Do not edit test files

[41] [41]

Think about edge cases and edit multiple files if needed

[42] [42]

Ensure each edit affects only the intended code region

[43] [43]

patch_generated

When the patch is ready, call: share_findings("patch_generated", "<description>")

[44] [44]

## Inter-agent Communication - Reproducer findings provide reproduction behavior and validation results

If validation fails, revise the patch and share a new patch_generated finding. ## Inter-agent Communication - Reproducer findings provide reproduction behavior and validation results. - If validation fails, reflect on the feedback and revise the patch. - Do not declare completion until the reproducer confirms that all tests pass. - Carefully decide what t...

2026

[45] [45]

- Search broadly for tests in the affected module or package

Identify and run existing regression tests related to the issue. - Search broadly for tests in the affected module or package. - Run the affected package's test suite, not only individual tests. - Register regression tests and run them before any fix. - Share baseline results with the other agents

[46] [46]

bug_confirmed

Write and run a comprehensive reproduction script. - Use existing tests to learn setup patterns. - Test each scenario separately. - Each scenario must have its own assertion and failure message. - Avoid combining configurations in one test, because this can mask bugs. - When the bug is reproduced, call: share_findings("bug_confirmed", "<details>")

[47] [47]

Wait for the patch editor's fix and call apply_patch() to apply it

[48] [48]

Thoroughly validate the patch. a. First check that the patched code compiles or passes a smoke test. b. Re-run the reproduction script. c. Run all registered regression tests. d. Run additional related tests for modules touched by the patch. e. Add edge-case tests for scenarios mentioned in the issue description. f. If the issue mentions multiple scenario...

[49] [49]

validation_passed

Share validation results. - If the bug is fixed and all tests pass, call: share_findings("validation_passed", "<summary>") - If any test fails, any error occurs, or the bug remains, call: share_findings("validation_failed", "<specific failure details>") Then call apply_patch() again to wait for the revised patch. - IMPORTANT: After a validation failure, R...

2026

[50] [50]

Choices",

Localization hints Determine whether the issue explicitly or implicitly mentions buggy files, classes, functions, methods, stack traces, modules, or code locations. **buggy_files**: List of FULL file paths mentioned or strongly implied **buggy_classes**: List of classes mentioned or implied in the issue (e.g. "Choices", "IntegerChoices"). **buggy_function...

[51] [51]

Repair strategy Determine whether the issue describes a fix strategy, expected code change, or implementation tip

[52] [52]

quality":

Reproduction hints Determine whether the issue includes reproduction steps, input examples, failing commands, expected behavior, actual behavior, stack traces, or test cases. Classify the issue quality as one of: - high: clear localization, repair, and reproduction information; - low: partial information, but additional repository exploration is needed; R...

2026