pith. sign in

arxiv: 2606.00308 · v1 · pith:VKNI5MPGnew · submitted 2026-05-29 · 💻 cs.SE · cs.AI· cs.LG

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

Pith reviewed 2026-06-28 21:12 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG
keywords multi-agent LLMcode complexityHumanEvalRADON metricsLLM code generationsoftware architectureagent orchestration
0
0 comments X

The pith

Six LLM code architectures collapse into two complexity clusters separated by a 50-130% gap with no accuracy gain for the heavier group.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common multi-agent pipelines for LLM code generation change the structural complexity of the output code. It runs six configurations across 164 HumanEval problems under two GPT-4o models and measures the results with five RADON metrics. The architectures divide into two stable clusters regardless of model or whether only passing code is considered. The analyst-coder layer drives most of the increase in complexity, the debugger reduces it when added to an analyst-coder base, and the tester raises it again. Extra complexity does not improve pass@1 rates.

Core claim

The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst-coder split inflates complexity, the runtime debugger does not - and on the analyst-coder background actively deflates it - and the tester re-inflates it. The heavy cluster's additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy.

What carries the argument

The paired non-parametric statistical pipeline (Friedman omnibus test followed by Wilcoxon signed-rank post-hoc tests with Holm correction) applied to RADON metrics across 1,968 paired observations from the six configurations.

If this is right

  • Architectural layers in LLM code systems should be added only when they produce measurable gains on the target dimension rather than assumed to help.
  • The analyst-coder split is the primary driver of elevated complexity across the tested setups.
  • Adding a runtime debugger to an analyst-coder base can lower measured complexity.
  • Re-introducing a tester layer after the debugger increases complexity once more.
  • Complexity differences do not translate into functional accuracy differences on the tested tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clustering pattern may hold for other code-generation benchmarks beyond HumanEval.
  • Teams building production multi-agent code systems could default to the leaner architectures to reduce unnecessary code complexity.
  • Future studies might check whether the same layer effects appear when using different complexity metrics or open-source models.
  • The results suggest that correctness-focused evaluations alone are insufficient for choosing among agent pipelines.

Load-bearing premise

The RADON metrics together with the paired statistical tests isolate the effect of architecture on complexity without being confounded by model biases or the specific choice of the 164 HumanEval tasks.

What would settle it

Repeating the experiment on a fresh benchmark set and finding either that the two clusters disappear or that the heavier cluster produces reliably higher pass@1 rates would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.00308 by Nazmus Ashrafi.

Figure 1
Figure 1. Figure 1: The ACT+Debugger pipeline as the union of three architectural layers: R (role decomposition: Analyst + Coder), T (testing with static LLM-based code review and bounded iteration), and D (runtime debugging with block-wise execution feedback and repair loop). Each of the six configurations is a subset of {R, T, D} (Table II); Basic = ∅. Solid arrows show data flow; dashed arrows show iteration loops. Notatio… view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of the five complexity metrics across the six generation architectures, for both models (all-completions, [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The cluster gap on a single task: HumanEval/35 (max_element), on which all six architectures produced passing code under gpt-4o-mini. All three lean-cluster architectures emit a byte-identical two-line Pythonic solution; all three heavy-cluster architectures likewise emit a byte-identical ten-line manual re-implementation with type-checking, an empty-list guard, and an explicit loop. Same task, same correc… view at source ↗
Figure 4
Figure 4. Figure 4: Matched-pairs rank-biserial correlation for all [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Friedman mean-rank diagram (SLOC, all-completions). Architectures sit on the rank axis at their mean rank (lower [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Median complexity profiles across the six architectures, with both models overlaid (all-completions). The two-cluster shape is reproduced by every [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mean SLOC per cell against pass@1, for the six architectures under each model (all-completions). Higher architectural complexity does not correspond [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt-level effects on code complexity, but the architecture-level question is open. We compare six widely-used multi-agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT-4o family across all 164 HumanEval tasks - 1,968 paired observations - using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's $W$ and matched-pairs rank-biserial effect sizes) in both all-completions and passing-only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst-coder split inflates complexity, the runtime debugger does not - and on the analyst-coder background actively deflates it - and the tester re-inflates it. The heavy cluster's additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy. Architectural elaboration in LLM code generation should therefore be justified by measured benefit on the dimensions that matter, not assumed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper reports a paired empirical comparison of six multi-agent LLM code-generation architectures (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) on all 164 HumanEval tasks using two GPT-4o-family models, yielding 1968 observations. Complexity is measured with the five RADON metrics (SLOC, cyclomatic complexity, Halstead Volume/Difficulty/Effort). A non-parametric pipeline (Friedman omnibus, Holm-corrected Wilcoxon signed-rank, Kendall’s W, matched-pairs rank-biserial effect sizes) is applied in both the all-completions and passing-only conditions. The architectures partition into two stable complexity clusters separated by a 50–130 % gap; the analyst-coder layer increases complexity, the debugger reduces it on an AC background, and the tester increases it. The heavier cluster confers no pass@1 advantage.

Significance. If the reported cluster separation and layer-wise effects hold, the work supplies concrete evidence that architectural elaboration in multi-agent LLM systems can substantially raise structural complexity without improving functional correctness. The paired design across 1968 observations, dual models, and both completion conditions strengthens internal validity. Explicit credit is due for the use of standard non-parametric tests with multiplicity correction and for reporting both all-completions and passing-only analyses. The result supplies a falsifiable, architecture-level claim that can be tested on other benchmarks or models.

minor comments (3)
  1. [§3] §3 (Methods): the precise prompt templates and hand-off protocols for each of the six architectures are referenced but not reproduced; including them (or a pointer to a public repository) would improve replicability.
  2. [§4.2] §4.2 (Statistical pipeline): the exact rule used to filter completions for the passing-only condition (e.g., whether syntax errors or runtime failures are excluded before or after RADON measurement) is not stated; this detail is needed to evaluate possible selection bias.
  3. [Table 2] Table 2 / Figure 3: the reported 50–130 % gap should be accompanied by the specific pairwise rank-biserial effect sizes and the architectures assigned to each cluster so that readers can verify the partition without re-running the analysis.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and detailed summary of our work, the recognition of its internal validity, and the recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

Empirical measurement study with no derivational chain

full rationale

The paper is a purely empirical paired study applying standard non-parametric tests (Friedman omnibus, Holm-corrected Wilcoxon, Kendall's W, rank-biserial effect sizes) to RADON metrics on 1968 observations from HumanEval. No equations, fitted parameters, ansatzes, or self-citations appear in the reported pipeline or conclusions; the two-cluster partition, layer-wise effects, and accuracy comparisons are direct statistical outputs from the data rather than reductions of any claimed derivation to its own inputs. The methods are externally verifiable and do not rely on prior author work for uniqueness or load-bearing premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical comparison relying on established code complexity metrics (RADON) and standard non-parametric statistical tests; no free parameters, invented entities, or ad-hoc axioms introduced.

axioms (1)
  • domain assumption Standard assumptions of Friedman omnibus test and Wilcoxon signed-rank test with Holm correction hold for the paired observations across architectures.
    Invoked to support the cluster identification and post-hoc comparisons in both all-completions and passing-only conditions.

pith-pipeline@v0.9.1-grok · 5830 in / 1297 out tokens · 30626 ms · 2026-06-28T21:12:08.755555+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Junet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  2. [2]

    Enhancing LLM code generation: A systematic evaluation of multi-agent collaboration and runtime debugging for improved accuracy, reliability, and latency,

    N. S. Ashrafi, S. Bouktif, and M. Mediani, “Enhancing LLM code generation: A systematic evaluation of multi-agent collaboration and runtime debugging for improved accuracy, reliability, and latency,” 2025, https://arxiv.org/abs/2505.02133

  3. [3]

    Unlocking code simplicity: The role of prompt patterns in managing LLM code complexity,

    A. Della Porta, G. Recupito, S. Lambiase, D. Di Nucci, and F. Palomba, “Unlocking code simplicity: The role of prompt patterns in managing LLM code complexity,” inProceedings of the IEEE International Con- ference on Software Analysis, Evolution and Reengineering Workshops (SANER-W), 2025

  4. [4]

    Programs, life cycles, and laws of software evolution,

    M. M. Lehman, “Programs, life cycles, and laws of software evolution,” Proceedings of the IEEE, vol. 68, no. 9, pp. 1060–1076, 1980

  5. [5]

    MetaGPT: Meta programming for a multi-agent collaborative framework,

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y . Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber, “MetaGPT: Meta programming for a multi-agent collaborative framework,” inInternational Conference on Learning Representations (ICLR), 2024

  6. [6]

    ChatDev: Communicative agents for software development,

    C. Qian, W. Liu, H. Liu, N. Chen, Y . Dang, J. Li, C. Yang, W. Chen, Y . Su, X. Conget al., “ChatDev: Communicative agents for software development,” inProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  7. [7]

    Self-Refine: Iterative Refinement with Self-Feedback

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iterative refinement with self-feedback,”arXiv preprint arXiv:2303.17651, 2023

  8. [8]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023

  9. [9]

    MapCoder: Multi-agent code generation for competitive problem solving,

    M. A. Islam, M. E. Ali, and M. R. Parvez, “MapCoder: Multi-agent code generation for competitive problem solving,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), Long Papers, 2024

  10. [10]

    Code generation with Al- phaCodium: From prompt engineering to flow engineering,

    T. Ridnik, D. Kredo, and I. Friedman, “Code generation with Al- phaCodium: From prompt engineering to flow engineering,” 2024, https://arxiv.org/abs/2401.08500

  11. [11]

    Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation,

    J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  12. [12]

    Why Do Multi-Agent LLM Systems Fail?

    M. Cemri, M. Z. Pan, S. Yang, L. A. Agrawal, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, D. Klein, K. Ramchandran, M. Zaharia, J. E. Gonzalez, and I. Stoica, “Why do multi-agent LLM systems fail?” inAdvances in Neural Information Processing Systems (NeurIPS), Track on Datasets and Benchmarks, 2025, https://arxiv.org/abs/2503.13657

  13. [13]

    LDB: A large language model debugger via verifying runtime execution step-by-step,

    L. Zhong, Z. Wang, and J. Shang, “LDB: A large language model debugger via verifying runtime execution step-by-step,”arXiv preprint arXiv:2402.16906, 2024

  14. [14]

    Executable code actions elicit better LLM agents,

    X. Wang, Y . Chen, L. Yuan, Y . Zhang, Y . Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inProceedings of the 41st International Conference on Machine Learning (ICML), PMLR 235, 2024

  15. [15]

    Do prompt patterns affect code quality? a first empirical assessment of ChatGPT-generated code,

    A. Della Porta, S. Lambiase, and F. Palomba, “Do prompt patterns affect code quality? a first empirical assessment of ChatGPT-generated code,” inProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), 2025

  16. [16]

    Program code generation: Single LLMs vs. multi-agent systems,

    B. Idrisov, E. Eisenacher, and T. Schlippe, “Program code generation: Single LLMs vs. multi-agent systems,” inProceedings of the 7th International Conference on Natural Language Processing (ICNLP). IEEE, 2025

  17. [17]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR), 2023

  18. [18]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “AgentCoder: Multi-agent-based code generation with iterative testing and optimisa- tion,”arXiv preprint arXiv:2312.13010, 2024

  19. [19]

    SWE-agent: Agent-computer interfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “SWE-agent: Agent-computer interfaces enable automated software engineering,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

  20. [20]

    Radon: a Python tool that computes various metrics from the source code,

    M. Lacchia, “Radon: a Python tool that computes various metrics from the source code,” https://radon.readthedocs.io/

  21. [21]

    A complexity measure,

    T. J. McCabe, “A complexity measure,” inIEEE Transactions on Software Engineering, vol. SE-2, no. 4, 1976, pp. 308–320

  22. [22]

    M. H. Halstead,Elements of Software Science. Elsevier North-Holland, 1977

  23. [23]

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance,

    M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,”Journal of the American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937

  24. [24]

    Individual comparisons by ranking methods,

    F. Wilcoxon, “Individual comparisons by ranking methods,”Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945

  25. [25]

    A simple sequentially rejective multiple test procedure,

    S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979

  26. [26]

    The simple difference formula: An approach to teach- ing nonparametric correlation,

    D. S. Kerby, “The simple difference formula: An approach to teach- ing nonparametric correlation,”Comprehensive Psychology, vol. 3, p. 11.IT.3.1, 2014

  27. [27]

    Creative and correct: Requesting diverse code solutions from AI foundation models,

    S. Blyth, M. Wagner, and C. Treude, “Creative and correct: Requesting diverse code solutions from AI foundation models,” inProceedings of the 1st ACM International Workshop on AI Foundation Models and Software Engineering (FORGE), 2024

  28. [28]

    SWE-bench: Can language models resolve real- world GitHub issues?

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real- world GitHub issues?” inInternational Conference on Learning Repre- sentations (ICLR), 2024

  29. [29]

    Enhancing LLM-based code generation with complexity metrics: A feedback-driven approach,

    M. Sepidband, H. Taherkhani, S. Wang, and H. Hemmati, “Enhancing LLM-based code generation with complexity metrics: A feedback-driven approach,” 2025, https://arxiv.org/abs/2505.23953