LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

Nagarjuna Kanamarlapudi; Praveen K

arxiv: 2606.01490 · v1 · pith:AWXCCW6Qnew · submitted 2026-05-31 · 💻 cs.SE · cs.AI· cs.MA

LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies

Nagarjuna Kanamarlapudi , Praveen K This is my paper

Pith reviewed 2026-06-28 16:11 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA

keywords multi-agent LLM collaborationsoftware architecture designadversarial topologiesautomated design evaluationfactorial experimentcross-model reviewdesign quality rubric

0 comments

The pith

A structural adversarial multi-agent topology produces the highest-rated software designs among twelve LLM collaboration structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper runs a controlled test of twelve ways for multiple large language models to collaborate on software architecture design tasks. It uses a factorial design across hundreds of runs and finds that one adversarial structure, where agents demand full rewrites instead of small patches, scores highest when judged by three separate automated evaluators. A cross-model review approach, with one model generating and another reviewing, comes in second place consistently. Parallel merging of agent outputs ranks lowest because of context limits and mismatched design pieces. The results show that the specific organization of agents affects the quality of the final designs in measurable ways.

Core claim

The paper establishes that in a 2×2×2 factorial experiment with 520 runs on eight design tasks, the structural adversarial topology v4b achieves the top weighted ensemble score of 4.637/5.0 on a 12-dimensional rubric, ahead of cross-model review at 4.606, while all three evaluators place parallel merge variants in the bottom tier at 3.65-3.79 due to token starvation and the Frankenstein effect.

What carries the argument

Structural adversarial topology v4b, a prompt-engineered multi-agent structure that requires complete rewrites rather than incremental patches during software design refinement.

If this is right

Structural adversarial topologies with rewrite mandates improve design quality over cooperative or merge-based structures.
Cross-model review, using separate models for generation and review, delivers consistently high performance across evaluators.
Parallel merge topologies produce lower-quality outputs due to token limits and inconsistent combined designs.
Different evaluator models agree on the best and worst topologies but can disagree on middle-ranked ones.
A weighted ensemble of multiple evaluator models supplies stable rankings across repeated trials.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The absence of human expert validation on the rubric means the reported quality differences could partly reflect evaluator model preferences rather than objective design merit.
The same collaboration structures might be tested on non-software tasks such as generating business requirements or technical specifications to check if the ranking pattern holds.
Adding a human-in-the-loop step to the top-performing topologies could produce further gains beyond the fully automated setting.
Repeating the experiment with a broader set of base models or larger design tasks could shift the relative performance of the twelve topologies.

Load-bearing premise

The 12-dimensional rubric applied by GPT-OSS 120B, Claude Opus 4.6, and Claude Sonnet 4.6 produces valid and unbiased assessments of software design quality without human validation.

What would settle it

A follow-up study in which human software architects rate the same set of generated designs and check whether the ranking order still places v4b first.

Figures

Figures reproduced from arXiv: 2606.01490 by Nagarjuna Kanamarlapudi, Praveen K.

**Figure 1.** Figure 1: Workflow topologies for all 12 variant configurations. Arrows indicate design/review/feedback flow between agents. Colors distinguish roles: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 4.** Figure 4: Quality distribution by variant and task complexity (Simple, Medium, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 3.** Figure 3: Quality score distribution by variant using weighted ensemble ranking. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Quality delta vs. v1 baseline by evaluator. Green indicates improve [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Cost-quality Pareto frontier (weighted ensemble ranking). Points above [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: The structural adversarial mechanism. A principal architect reviewer [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Experimental variants mapped onto the 2×2×2 design space. Circle colors indicate variant family; annotations show weighted ensemble quality tier placement. is not a flaw in either evaluator; it reflects systematic differences in how model families weight architectural qualities. These evaluator differences are informative. They reveal that design quality is multifaceted and that different model families h… view at source ↗

**Figure 9.** Figure 9: Three-evaluator cross-validation: agreement (all rank v4b #1, v3 [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a $2\times2\times2$ factorial design (Authority $\times$ Roles $\times$ Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). We report four core findings. First, structural adversarial (v4b) ranks #1 by ensemble -- a prompt-engineered adversarial variant that demands rewrite mandates rather than patches (weighted ensemble: 4.637/5.0). Second, cross-model review wins unanimously at #2 -- generate with one model, review with another -- ranking #2 by all three evaluators (weighted ensemble: 4.606). Third, evaluator diversity is itself a finding -- all three evaluators agree v4b is best and v3 is worst, but disagree sharply on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), revealing how different model families weight design qualities. Fourth, parallel merge is fundamentally broken -- all three evaluators place merge variants in the bottom tier (3.65-3.79), due to token starvation and the Frankenstein effect. The weighted ensemble ($2\times$Opus + $2\times$Sonnet + $1\times$GPT-OSS) provides robust rankings across 520 runs, confirmed through independent cross-validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

520-run comparison of agent topologies shows adversarial and cross-model setups ahead, but evaluation rests on unvalidated LLM rubrics.

read the letter

The one or two things to know: this paper ran a controlled experiment with 520 trials on 12 different multi-agent LLM topologies for software design, using a factorial design, and reports that an adversarial variant and cross-model review performed best by their metrics while merge methods failed.

What stands out as new is the specific comparison data across those topologies with multiple repetitions and three different evaluator models. Most work in this area is smaller or less systematic, so the run count and the observation that evaluators agree on the top and bottom but differ in the middle is useful raw information.

They handle the scale reasonably by using an ensemble score and noting the disagreements, which adds some transparency.

The soft spot is the reliance on LLM-based evaluation without any human validation. The 12-dimensional rubric is applied by GPT-OSS 120B, Claude Opus, and Sonnet, but there's no mention of checking if those scores align with expert human judgment or any inter-rater reliability with people. Since the evaluators are LLMs, and the agents are also LLMs, it's possible the results capture model preferences more than actual design quality. The paper treats the ensemble as robust, but that robustness is within the LLM assessment framework, not against external standards. The tasks and how they were chosen also aren't detailed here, which makes it hard to judge generalizability.

This paper is for people already working on or interested in multi-agent setups for code and design tasks who want some empirical pointers on what configurations to try. It won't settle debates on agent collaboration but gives numbers to build on.

I would send it to peer review. The experiment is large enough and the questions are practical enough that referees could help improve the evaluation approach and clarify the methods.

Referee Report

3 major / 1 minor

Summary. The manuscript describes a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Employing a 2×2×2 factorial design (Authority × Roles × Dynamics), it performs 520 runs across 8 tasks with 5 repetitions, scoring outputs via a 12-dimensional rubric applied by three LLM evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). Key findings include the structural adversarial topology (v4b) ranking first with weighted ensemble score 4.637/5.0, cross-model review second at 4.606, evaluator diversity effects, and parallel merge topologies performing poorly due to token starvation and Frankenstein effect.

Significance. Should the LLM-based rubric prove to be a reliable proxy for software design quality, this work provides substantial empirical evidence on the effectiveness of different multi-agent collaboration structures. Strengths include the large scale of 520 runs, the factorial design allowing isolation of factors, and the use of multiple independent evaluators which highlights disagreements and provides a weighted ensemble. The findings on adversarial prompting and cross-model review could guide practical implementations in LLM-based software engineering tools.

major comments (3)

[Evaluation Methodology] Evaluation Methodology: The 12-dimensional rubric is applied solely by three LLMs without any reported human-expert calibration, inter-rater reliability statistics, or validation against external ground truth. This is load-bearing for all ranking claims, including v4b at 4.637/5.0 and the identification of merge variants as broken (3.65-3.79), as differences may reflect model biases rather than design quality.
[Task Selection and Experimental Design] Task Selection: Details on the criteria for selecting the 8 design tasks, their specific complexity levels, and how they represent varying software architecture challenges are not provided. This affects the generalizability of the topology rankings across the 520 runs.
[Results and Statistical Analysis] Results: The paper reports rankings and some differences (e.g., d=1.44 vs. d=0.45 on v2b) but lacks description of statistical tests used to confirm significance of differences between topologies or methods for aggregating or handling disagreements among the three evaluators in the weighted ensemble (2×Opus + 2×Sonnet + 1×GPT-OSS).

minor comments (1)

[Abstract] The abstract mentions 'confirmed through independent cross-validation' but the full text should clarify what this cross-validation entails to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below, indicating planned revisions to the manuscript.

read point-by-point responses

Referee: [Evaluation Methodology] Evaluation Methodology: The 12-dimensional rubric is applied solely by three LLMs without any reported human-expert calibration, inter-rater reliability statistics, or validation against external ground truth. This is load-bearing for all ranking claims, including v4b at 4.637/5.0 and the identification of merge variants as broken (3.65-3.79), as differences may reflect model biases rather than design quality.

Authors: We agree this is a substantive limitation. The LLM evaluators were selected to support the scale of 520 runs, which would be infeasible with human raters. In revision we will add an 'Evaluator Agreement' subsection reporting pairwise correlations and agreement metrics across the three models. We will also expand the limitations section to explicitly note the lack of human calibration and the possibility of model-specific biases influencing rankings. The weighted ensemble (2×Opus + 2×Sonnet + 1×GPT-OSS) was intended to reduce single-model bias, but we will clarify this rationale. revision: partial
Referee: [Task Selection and Experimental Design] Task Selection: Details on the criteria for selecting the 8 design tasks, their specific complexity levels, and how they represent varying software architecture challenges are not provided. This affects the generalizability of the topology rankings across the 520 runs.

Authors: We will revise the 'Experimental Tasks' section to specify selection criteria, including domain coverage (web services, distributed systems, embedded control) and complexity metrics (component count, interaction density). Brief descriptions of each task will be added to demonstrate how they span the range of architecture challenges. revision: yes
Referee: [Results and Statistical Analysis] Results: The paper reports rankings and some differences (e.g., d=1.44 vs. d=0.45 on v2b) but lacks description of statistical tests used to confirm significance of differences between topologies or methods for aggregating or handling disagreements among the three evaluators in the weighted ensemble (2×Opus + 2×Sonnet + 1×GPT-OSS).

Authors: We will insert a 'Statistical Analysis' subsection describing the tests (repeated-measures ANOVA across tasks with post-hoc Tukey HSD and effect-size reporting) used to assess topology differences. The ensemble aggregation method will be detailed, including the rationale for the 2:2:1 weighting and how per-evaluator scores were combined. The cross-validation mentioned in the abstract will be expanded to clarify the hold-out procedure used to verify ranking stability. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of experimental outcomes

full rationale

The paper describes a controlled factorial experiment with 520 runs, direct rubric scoring by three LLMs, and reporting of resulting rankings via a fixed weighted ensemble. No derivations, equations, fitted parameters, or predictions are present that could reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The central claims are observational results from the described protocol, making the work self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on the domain assumption that automated LLM evaluators can stand in for human judgment of software design quality; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6) using a 12-dimensional rubric produce reliable rankings of software design quality
The experiment treats the three models' scores as ground truth for comparing topologies without external validation.

pith-pipeline@v0.9.1-grok · 5826 in / 1232 out tokens · 25497 ms · 2026-06-28T16:11:24.334181+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages · 6 internal anchors

[1]

ChatDev: Communicative Agents for Software Development

T. Qin et al., “ChatDev: Communicative agents for software develop- ment,” arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong et al., “MetaGPT: Meta programming for multi-agent collabo- rative framework,” arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

M. Huang et al., “AgentCoder: Multi-agent code generation with iterative testing and optimization,” arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

MapCoder: Multi-agent code generation for competitive programming,

M. Islam et al., “MapCoder: Multi-agent code generation for competitive programming,” arXiv:2405.11403, 2024

work page arXiv 2024
[5]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Y . Du et al., “Improving factuality and reasoning in language models through multiagent debate,” arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

T. Liang et al., “Encouraging divergent thinking in large language models through multi-agent debate,” arXiv:2305.19118, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Krippendorff, Content Analysis: An Introduction to Its Methodology, 4th ed

K. Krippendorff, Content Analysis: An Introduction to Its Methodology, 4th ed. Thousand Oaks, CA: SAGE, 2019

2019
[8]

The use of ranks to avoid the assumption of normality implicit in the analysis of variance,

M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” J. American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937

1937
[9]

Individual comparisons by ranking methods,

F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945

1945
[10]

Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988

1988
[11]

Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software

E. Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Boston, MA: Addison-Wesley, 2004

2004
[12]

Gemini 2.5 Pro Technical Report,

Google, “Gemini 2.5 Pro Technical Report,” 2025. [Online]. Available: https://ai.google.dev

2025
[13]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

ChatDev: Communicative Agents for Software Development

T. Qin et al., “ChatDev: Communicative agents for software develop- ment,” arXiv:2307.07924, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

S. Hong et al., “MetaGPT: Meta programming for multi-agent collabo- rative framework,” arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

M. Huang et al., “AgentCoder: Multi-agent code generation with iterative testing and optimization,” arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

MapCoder: Multi-agent code generation for competitive programming,

M. Islam et al., “MapCoder: Multi-agent code generation for competitive programming,” arXiv:2405.11403, 2024

work page arXiv 2024

[5] [5]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Y . Du et al., “Improving factuality and reasoning in language models through multiagent debate,” arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

T. Liang et al., “Encouraging divergent thinking in large language models through multi-agent debate,” arXiv:2305.19118, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Krippendorff, Content Analysis: An Introduction to Its Methodology, 4th ed

K. Krippendorff, Content Analysis: An Introduction to Its Methodology, 4th ed. Thousand Oaks, CA: SAGE, 2019

2019

[8] [8]

The use of ranks to avoid the assumption of normality implicit in the analysis of variance,

M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” J. American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937

1937

[9] [9]

Individual comparisons by ranking methods,

F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945

1945

[10] [10]

Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed

J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum, 1988

1988

[11] [11]

Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software

E. Evans, Domain-Driven Design: Tackling Complexity in the Heart of Software. Boston, MA: Addison-Wesley, 2004

2004

[12] [12]

Gemini 2.5 Pro Technical Report,

Google, “Gemini 2.5 Pro Technical Report,” 2025. [Online]. Available: https://ai.google.dev

2025

[13] [13]

GPT-4 Technical Report

OpenAI, “GPT-4 Technical Report,” arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023