When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

Ethan Chandra; Fangru Lin; Lunyiu Nie; Stanislav Gannutin; Swarat Chaudhuri; Xu Yang

arxiv: 2606.00953 · v1 · pith:J2BCK4FMnew · submitted 2026-05-31 · 💻 cs.LG · cs.MA

When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

Xu Yang , Lunyiu Nie , Ethan Chandra , Stanislav Gannutin , Fangru Lin , Swarat Chaudhuri This is my paper

Pith reviewed 2026-06-28 17:52 UTC · model grok-4.3

classification 💻 cs.LG cs.MA

keywords multi-agent LLM systemscode generationgraph partitioningtask decompositiondependency analysissoftware engineeringagent orchestration

0 comments

The pith

Cohesion-aware partitioning of code dependency graphs lets multi-agent LLM coders raise pass rates while cutting time and cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats multi-agent coding as a graph partitioning task that trades off shorter critical paths against the cost of moving context between agents. It builds Co-Coder, which extracts a dependency graph via static analysis, isolates hub files, applies community detection to form cohesive partitions, and runs them with a dependency-aware scheduler. On 28 repository-level tasks the method improves pass rate, wall-clock time, and API cost over sequential, file-based, and other agent-team baselines, with the biggest gains on projects that have dense inter-file dependencies. The work shows that respecting code structure when assigning tasks to agents produces measurable efficiency gains rather than naive parallelism.

Core claim

Treating repository-level coding tasks as a graph partitioning problem on static dependency graphs, solved by isolating hubs and using community detection followed by dependency-aware scheduling, yields task assignments that improve pass rate by up to 14 percent, wall-clock speedup up to 2.1 times, and API cost reduction up to 35 percent compared with sequential and file-based parallel baselines.

What carries the argument

Static-analysis dependency graph partitioned by community detection after hub isolation, executed by a dependency-aware scheduler that assigns cohesive subgraphs to separate agents.

If this is right

Partitions that group files with many shared dependencies reduce the volume of context that must cross agent boundaries.
Community detection on the dependency graph produces more efficient task splits than assigning whole files or running everything sequentially.
The largest efficiency and quality gains appear on projects whose dependency graphs are densest.
Dependency-aware scheduling prevents agents from starting work on tasks whose prerequisites have not yet completed.
Reducing redundant context transfers directly lowers the number of tokens sent to the underlying LLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-partitioning logic could be applied to other multi-agent workflows that involve shared state, such as collaborative planning or distributed data analysis.
If the static graph underestimates certain runtime dependencies, an online refinement step that updates the partition during execution could recover some of the lost gains.
Replacing the fixed community-detection heuristic with a learned cost model that predicts actual LLM context-transfer expense might tighten the partitions further.

Load-bearing premise

Static analysis of the repository produces dependency graphs that accurately reflect the real communication costs agents incur when passing context to one another.

What would settle it

Measure whether the reported gains disappear on a high-dependency project when the static graph is replaced by one built from actual runtime call traces that include dynamic dependencies missed by static analysis.

Figures

Figures reproduced from arXiv: 2606.00953 by Ethan Chandra, Fangru Lin, Lunyiu Nie, Stanislav Gannutin, Swarat Chaudhuri, Xu Yang.

**Figure 1.** Figure 1: Average pass rate on DevEval and CodeProjectEval. Labels above bars: wallclock speedup vs. Sequential version; labels inside bars: avg. API cost per task (USD). Inspired by classic distributed computing systems, we propose Cohesion-aware Coder (Co-Coder), an orchestration framework that uses directed acyclic graphs to partition and schedule multi-agent systems. Co-Coder represents the project as a weight… view at source ↗

**Figure 2.** Figure 2: Overview of Co-Coder. The input specifications from the benchmarks are first parsed into [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Condensed RIB for the zxcvbn project. Three of six files shown; remaining files (feedback.py, time_estimates.py, __main__.py) omitted for space. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

read the original abstract

Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-Coder shows that community detection on static dependency graphs can improve multi-agent LLM coding results on dense repos, but the static-to-semantic cost mapping is the part that needs checking.

read the letter

The main takeaway is that formalizing multi-agent coding orchestration as graph partitioning on static dependency graphs, then applying community detection, produces measurable gains over sequential and simple parallel baselines on repository tasks.

What is new is the explicit mapping of task decomposition to a communication-to-computation trade-off solved via static analysis graphs, hub isolation, community detection, and dependency-aware scheduling. The Co-Coder system puts this into practice for repo-level software engineering.

The paper does well on the empirical side. It evaluates on 28 tasks from DevEval and CodeProjectEval, reports up to 14% higher pass rates, 2.1x wall-clock speedup, and 35% lower API cost versus sequential, file-based parallel, and Claude agent-team baselines, with the largest improvements on the most dependency-dense projects. The evaluation uses external benchmarks and published baselines, so the gains are not obviously circular.

The soft spot is the central modeling assumption. Static analysis edges (imports, calls) are used to weight the graph for context-transfer costs, but these syntactic relations may not match the semantic context that LLM agents actually exchange. If the two diverge, the partitions optimize the wrong objective. The paper would be stronger with direct evidence that the detected communities align with measured agent communication overhead rather than just syntactic structure.

A secondary check is whether the reported improvements are robust to different community detection algorithms or graph-construction choices, and whether statistical significance is shown for the 14%/2.1x/35% figures.

This paper is for researchers working on scaling multi-agent LLM systems to large codebases. Anyone interested in practical ways to reduce communication overhead in parallel coding agents will find the method and numbers useful. It has enough concrete technique and external-benchmark results to deserve a serious referee.

Referee Report

2 major / 2 minor

Summary. The paper claims that formalizing multi-agent LLM coding as a graph-partitioning problem—where static-analysis dependency graphs are partitioned via community detection to balance communication-to-computation costs—yields Co-Coder, which improves over sequential, file-based parallel, and Claude Code baselines. On 28 tasks from DevEval and CodeProjectEval it reports up to 14% higher pass rate, 2.1× wall-clock speedup, and 35% lower API cost, with largest gains on dependency-dense projects.

Significance. If the empirical results survive scrutiny of baseline fairness, statistical significance, and the validity of static graphs as proxies for LLM context-transfer costs, the work supplies a concrete, graph-theoretic design principle for orchestrating parallel LLM agents. The explicit modeling of the communication-to-computation trade-off and the use of community detection on real repository graphs are strengths that could generalize beyond coding.

major comments (2)

[Evaluation] Evaluation section: the reported 14.0% / 2.10× / 35% gains are presented as point estimates without reported standard deviations, number of runs, or statistical significance tests; this makes it impossible to judge whether the Pareto-frontier advance is robust or could be explained by run-to-run variance or post-hoc partition selection.
[Method] Method (graph construction and scheduler): the central claim that cohesion-aware partitioning improves the relevant trade-off rests on the untested assumption that static-analysis edges (imports, calls) accurately encode the context-transfer costs actually incurred by LLM agents; no ablation, correlation study, or human annotation validates that syntactic dependencies predict semantic context volume or latency.

minor comments (2)

[Abstract] Abstract and §1: the phrase "advances the Pareto-frontier" is used without defining the axes or showing the full frontier plot; a single sentence clarifying the three metrics and how the frontier is constructed would help.
[Throughout] Notation: the paper introduces "Cohesion-aware Coder (Co-Coder)" but later refers to "Co-coder"; consistent capitalization and acronym usage throughout would reduce minor confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on evaluation robustness and the grounding of our graph-partitioning assumptions. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 14.0% / 2.10× / 35% gains are presented as point estimates without reported standard deviations, number of runs, or statistical significance tests; this makes it impossible to judge whether the Pareto-frontier advance is robust or could be explained by run-to-run variance or post-hoc partition selection.

Authors: We acknowledge the limitation in the current presentation. The reported figures are single-run point estimates obtained under fixed random seeds for reproducibility. In the revised manuscript we will re-execute all 28 tasks with at least five independent runs per method, report means and standard deviations, and include paired statistical tests (Wilcoxon signed-rank) together with p-values. We will also document the deterministic partition-selection procedure to rule out post-hoc bias. revision: yes
Referee: [Method] Method (graph construction and scheduler): the central claim that cohesion-aware partitioning improves the relevant trade-off rests on the untested assumption that static-analysis edges (imports, calls) accurately encode the context-transfer costs actually incurred by LLM agents; no ablation, correlation study, or human annotation validates that syntactic dependencies predict semantic context volume or latency.

Authors: Static-analysis edges are a standard, inexpensive proxy in repository-level software engineering; our results already show that gains scale with dependency density, providing indirect empirical support. Nevertheless, we agree that an explicit ablation would strengthen the claim. The revision will add (i) a comparison of community-detection partitions against random and file-based baselines on the same graphs and (ii) a brief discussion of the proxy’s limitations. A dedicated human annotation or correlation study of context volume is beyond the scope of the present work and is noted as future research. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper formalizes orchestration as graph partitioning but reports results solely via empirical evaluation on external benchmarks (DevEval, CodeProjectEval) against published baselines. No equations, fitted parameters, or predictions reduce the reported pass-rate/speedup/cost gains to quantities defined inside the paper. No self-citation chains or ansatzes are load-bearing for the central claims. The derivation is self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that static dependency edges are a sufficient proxy for LLM context-transfer cost.

pith-pipeline@v0.9.1-grok · 5779 in / 1161 out tokens · 18532 ms · 2026-06-28T17:52:10.901081+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 9 canonical work pages · 4 internal anchors

[1]

C., Arun Iyer, Suresh Parthasarathy, Sriram K

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. Codeplan: Repository-level coding using llms and planning.CoRR, abs/2309.12499, 2023

work page arXiv 2023
[2]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya G. Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?CoRR, abs/2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Culler, Richard M

David E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahay, Klaus E. Schauser, Eunice E. Santos, Ramesh Subramonian, and Thorsten von Eicken. Logp: Towards a realistic model of parallel computation. In Marina C. Chen and Robert Halstead, editors,Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOP...

1993
[4]

TCP: a benchmark for temporal constraint-based planning

Zifeng Ding, Sikuan Yan, Moy Yuan, Xianglong Hu, Fangru Lin, and Andreas Vlachos. TCP: a benchmark for temporal constraint-based planning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9...

2025
[5]

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. Devbench: A realistic, developer-informed benchmark for code generation models.CoRR, abs/2601.11895, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

R. L. Graham. Bounds for certain multiprocessing anomalies.The Bell System Technical Journal, 45(9):1563–1581, 1966

1966
[7]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

2024
[8]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[9]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[10]

Planning and scheduling in the process industry.OR Spectr., 24(3):219–250, 2002

Josef Kallrath. Planning and scheduling in the process industry.OR Spectr., 24(3):219–250, 2002

2002
[11]

A fast and high quality multilevel scheme for partitioning irregular graphs.SIAM J

George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.SIAM J. Sci. Comput., 20(1):359–392, 1998

1998
[12]

Kernighan and Shen Lin

Brian W. Kernighan and Shen Lin. An efficient heuristic procedure for partitioning graphs.Bell Syst. Tech. J., 49(2):291–307, 1970

1970
[13]

Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J

Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, and Diyi Yang. Cooperbench: Why coding agents cannot be your teammates yet.CoRR, abs/2601.13295, 2026

work page arXiv 2026
[14]

Mahoney, Kurt Keutzer, and Amir Gholami

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. An LLM compiler for parallel function calling. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024,...

2024
[15]

Prompting large language models to tackle the full software development lifecycle: A case study

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eu...

2025
[16]

CAMEL: communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- matio...

2023
[17]

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li, Bin Gu, and Mengfei Yang. Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Lun-Wei Ku, Andre Ma...

2024
[18]

Cohn, and Janet B

Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, and Janet B. Pierrehumbert. Graph-enhanced large language models in asynchronous plan reasoning. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference ...

2024
[19]

Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024
[20]

Large language models miss the multi-agent mark

Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael J. Wooldridge. Large language models miss the multi-agent mark.CoRR, abs/2505.21298, 2025

work page arXiv 2025
[21]

Martin.Agile Software Development: Principles, Patterns, and Practices

R.C. Martin.Agile Software Development: Principles, Patterns, and Practices. Alan Apt series. Pearson Education, 2003

2003
[22]

Collins, Ilia Sucholutsky, Natalia Vélez, and Thomas L

Elizabeth Mieczkowski, Katherine M. Collins, Ilia Sucholutsky, Natalia Vélez, and Thomas L. Griffiths. Language model teams as distributed systems.CoRR, abs/2603.12229, 2026

work page arXiv 2026
[23]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

2024
[24]

Scaling large language model- based multi-agent collaboration

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model- based multi-agent collaboration. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025
[25]

Maps of random walks on complex networks reveal community structure.Proceedings of the National Academy of Sciences, 105(4):1118–1123, 2008

Martin Rosvall and Carl T Bergstrom. Maps of random walks on complex networks reveal community structure.Proceedings of the National Academy of Sciences, 105(4):1118–1123, 2008

2008
[26]

Silverman, Jason D

Alvin Wei Ming Tan, Chunhua Yu, Bria Long, Wanjing Ma, Tonya Murray, Rebecca D. Silverman, Jason D. Yeatman, and Michael C. Frank. Devbench: A multimodal developmental 11 benchmark for language learning. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Informatio...

2024
[27]

Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Trans

Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Trans. Parallel Distributed Syst., 13(3):260– 274, 2002

2002
[28]

Leslie G. Valiant. A bridging model for parallel computation.Commun. ACM, 33(8):103–111, 1990

1990
[29]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. InThe...

2025
[30]

Repoformer: Selective retrieval for repository-level code completion

Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. Repoformer: Selective retrieval for repository-level code completion. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024,...

2024
[31]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework.CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Syst...

2024
[34]

Towards realistic project- level code generation via multi-agent collaboration and semantic architecture modeling.CoRR, abs/2511.03404, 2025

Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, and Yuanping Guo. Towards realistic project- level code generation via multi-agent collaboration and semantic architecture modeling.CoRR, abs/2511.03404, 2025

work page arXiv 2025
[35]

DevBench

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024...

2024

[1] [1]

C., Arun Iyer, Suresh Parthasarathy, Sriram K

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Vageesh D. C., Arun Iyer, Suresh Parthasarathy, Sriram K. Rajamani, Balasubramanyan Ashok, and Shashank Shet. Codeplan: Repository-level coding using llms and planning.CoRR, abs/2309.12499, 2023

work page arXiv 2023

[2] [2]

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya G. Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do multi-agent LLM systems fail?CoRR, abs/2503.13657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Culler, Richard M

David E. Culler, Richard M. Karp, David A. Patterson, Abhijit Sahay, Klaus E. Schauser, Eunice E. Santos, Ramesh Subramonian, and Thorsten von Eicken. Logp: Towards a realistic model of parallel computation. In Marina C. Chen and Robert Halstead, editors,Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOP...

1993

[4] [4]

TCP: a benchmark for temporal constraint-based planning

Zifeng Ding, Sikuan Yan, Moy Yuan, Xianglong Hu, Fangru Lin, and Andreas Vlachos. TCP: a benchmark for temporal constraint-based planning. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9...

2025

[5] [5]

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

Pareesa Ameneh Golnari, Adarsh Kumarappan, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, and Elsie Nallipogu. Devbench: A realistic, developer-informed benchmark for code generation models.CoRR, abs/2601.11895, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

R. L. Graham. Bounds for certain multiprocessing anomalies.The Bell System Technical Journal, 45(9):1563–1581, 1966

1966

[7] [7]

Metagpt: Meta programming for A multi-agent collaborative framework

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. Metagpt: Meta programming for A multi-agent collaborative framework. InThe Twelfth International Conference on Learning Representations, ICL...

2024

[8] [8]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[9] [9]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[10] [10]

Planning and scheduling in the process industry.OR Spectr., 24(3):219–250, 2002

Josef Kallrath. Planning and scheduling in the process industry.OR Spectr., 24(3):219–250, 2002

2002

[11] [11]

A fast and high quality multilevel scheme for partitioning irregular graphs.SIAM J

George Karypis and Vipin Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs.SIAM J. Sci. Comput., 20(1):359–392, 1998

1998

[12] [12]

Kernighan and Shen Lin

Brian W. Kernighan and Shen Lin. An efficient heuristic procedure for partitioning graphs.Bell Syst. Tech. J., 49(2):291–307, 1970

1970

[13] [13]

Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J

Arpandeep Khatua, Hao Zhu, Peter Tran, Arya Prabhudesai, Frederic Sadrieh, Johann K. Lieberwirth, Xinkai Yu, Yicheng Fu, Michael J. Ryan, Jiaxin Pei, and Diyi Yang. Cooperbench: Why coding agents cannot be your teammates yet.CoRR, abs/2601.13295, 2026

work page arXiv 2026

[14] [14]

Mahoney, Kurt Keutzer, and Amir Gholami

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W. Mahoney, Kurt Keutzer, and Amir Gholami. An LLM compiler for parallel function calling. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024,...

2024

[15] [15]

Prompting large language models to tackle the full software development lifecycle: A case study

Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, Zhiyin Yu, He Du, Ping Yang, Dahua Lin, Chao Peng, and Kai Chen. Prompting large language models to tackle the full software development lifecycle: A case study. In Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eu...

2025

[16] [16]

CAMEL: communicative agents for "mind" exploration of large language model society

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: communicative agents for "mind" exploration of large language model society. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Infor- matio...

2023

[17] [17]

Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories

Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Huanyu Liu, Hao Zhu, Lecheng Wang, Kaibo Liu, Zheng Fang, Lanshen Wang, Jiazheng Ding, Xuanming Zhang, Yuqi Zhu, Yihong Dong, Zhi Jin, Binhua Li, Fei Huang, Yongbin Li, Bin Gu, and Mengfei Yang. Deveval: A manually-annotated code generation benchmark aligned with real-world code repositories. In Lun-Wei Ku, Andre Ma...

2024

[18] [18]

Cohn, and Janet B

Fangru Lin, Emanuele La Malfa, Valentin Hofmann, Elle Michelle Yang, Anthony G. Cohn, and Janet B. Pierrehumbert. Graph-enhanced large language models in asynchronous plan reasoning. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference ...

2024

[19] [19]

Tianyang Liu, Canwen Xu, and Julian J. McAuley. Repobench: Benchmarking repository- level code auto-completion systems. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024

2024

[20] [20]

Large language models miss the multi-agent mark

Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, and Michael J. Wooldridge. Large language models miss the multi-agent mark.CoRR, abs/2505.21298, 2025

work page arXiv 2025

[21] [21]

Martin.Agile Software Development: Principles, Patterns, and Practices

R.C. Martin.Agile Software Development: Principles, Patterns, and Practices. Alan Apt series. Pearson Education, 2003

2003

[22] [22]

Collins, Ilia Sucholutsky, Natalia Vélez, and Thomas L

Elizabeth Mieczkowski, Katherine M. Collins, Ilia Sucholutsky, Natalia Vélez, and Thomas L. Griffiths. Language model teams as distributed systems.CoRR, abs/2603.12229, 2026

work page arXiv 2026

[23] [23]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Li...

2024

[24] [24]

Scaling large language model- based multi-agent collaboration

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Scaling large language model- based multi-agent collaboration. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025

2025

[25] [25]

Maps of random walks on complex networks reveal community structure.Proceedings of the National Academy of Sciences, 105(4):1118–1123, 2008

Martin Rosvall and Carl T Bergstrom. Maps of random walks on complex networks reveal community structure.Proceedings of the National Academy of Sciences, 105(4):1118–1123, 2008

2008

[26] [26]

Silverman, Jason D

Alvin Wei Ming Tan, Chunhua Yu, Bria Long, Wanjing Ma, Tonya Murray, Rebecca D. Silverman, Jason D. Yeatman, and Michael C. Frank. Devbench: A multimodal developmental 11 benchmark for language learning. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Informatio...

2024

[27] [27]

Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Trans

Haluk Topcuoglu, Salim Hariri, and Min-You Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing.IEEE Trans. Parallel Distributed Syst., 13(3):260– 274, 2002

2002

[28] [28]

Leslie G. Valiant. A bridging model for parallel computation.Commun. ACM, 33(8):103–111, 1990

1990

[29] [29]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. Openhands: An open platform for AI software developers as generalist agents. InThe...

2025

[30] [30]

Repoformer: Selective retrieval for repository-level code completion

Di Wu, Wasi Uddin Ahmad, Dejiao Zhang, Murali Krishna Ramanathan, and Xiaofei Ma. Repoformer: Selective retrieval for repository-level code completion. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024,...

2024

[31] [31]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen LLM applications via multi-agent conversation framework.CoRR, abs/2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents.CoRR, abs/2407.01489, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. Swe-agent: Agent-computer interfaces enable automated software engineering. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Syst...

2024

[34] [34]

Towards realistic project- level code generation via multi-agent collaboration and semantic architecture modeling.CoRR, abs/2511.03404, 2025

Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, and Yuanping Guo. Towards realistic project- level code generation via multi-agent collaboration and semantic architecture modeling.CoRR, abs/2511.03404, 2025

work page arXiv 2025

[35] [35]

DevBench

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. Gptswarm: Language agents as optimizable graphs. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on Machine Learning, ICML 2024...

2024