arxiv: 2604.27691 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

When Agents Evolve, Institutions Follow

Chao Fei , Hongcheng Guo , Yanghua Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:43 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemslarge language modelsgovernance institutionscollective intelligencecoordinationagent architectureshistorical institutionsorganizational topology

0 comments

The pith

Different institutions produce 57-point gaps in multi-agent LLM performance

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors map seven historical political institutions to multi-agent setups for large language models. These setups differ in how they assign roles for proposing ideas, reviewing them, executing actions, and fixing mistakes. Testing across models and tasks shows that some institutions lead to much better group outcomes than others. The gap can exceed 57 percentage points even when every agent uses the same base model. This matters because it means progress in collective AI may come from better ways to organize agents rather than solely from bigger or smarter individual models, and that the best organization can change as the models or problems evolve.

Core claim

We translate seven historical political institutions, spanning four canonical governance patterns, into executable multi-agent architectures and evaluate them under identical conditions across three large language models and two benchmarks. We find that governance topology strongly shapes collective performance. Within a single model, the gap between the best and worst institution exceeds 57 percentage points, while the optimal architecture shifts systematically with model capability and task characteristics. These results suggest that collective intelligence will not advance through a single optimal organizational form, but through governance mechanisms that can be reselected and re-configu

What carries the argument

Governance topology as the structure that assigns roles for proposing, reviewing, executing, and error correction in multi-agent LLM systems.

If this is right

Different institutions create performance differences exceeding 57 percentage points on identical tasks within one model.
The optimal institution shifts systematically as model capability increases or tasks change.
Collective intelligence advances by reselecting governance mechanisms rather than fixing one organizational form.
Multi-agent systems can evolve their institutions to match evolving capabilities and task demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future systems could include automatic selectors that switch between institutional templates based on real-time performance.
The method could extend to other coordination frameworks from economics or biology for agent collectives.
Benchmarks for multi-agent systems should routinely vary organizational structures to measure their impact.
This suggests designing AI teams with built-in adaptability in their coordination rules as models scale.

Load-bearing premise

That the seven historical political institutions can be translated into executable multi-agent architectures while preserving their essential coordination trade-offs without introducing implementation artifacts.

What would settle it

Experiments that apply different implementation details to the same institutions and observe that the large performance gaps no longer appear or no longer depend on model capability would falsify the central role of governance topology.

Figures

Figures reproduced from arXiv: 2604.27691 by Chao Fei, Hongcheng Guo, Yanghua Xiao.

**Figure 1.** Figure 1: Historical institutions as a design space for multi-agent governance. Millennia of political view at source ↗

**Figure 2.** Figure 2: Overview of SOCIALSYSTEMARENA. Left: Seven historical institutions span four canonical governance patterns. Center: Each institution is formalized as a declarative governance specification and executed on a unified runtime across three LLM backends and two benchmarks. Right: A long-horizon vision of self-evolving multi-agent systems that reconfigure governance topology as tasks and model capabilities evolv… view at source ↗

**Figure 3.** Figure 3: Four canonical governance patterns. (a) Pipeline: linear single-direction flow. (b) Gated Pipeline: pipeline with gate stages that can reject and loop back. (c) Autonomous Cluster: orchestrator dispatches to parallel subsystems. (d) Consensus: proposer triggers parallel voting; tally determines flow. Circles denote stages, diamonds denote gate stages, double circles denote terminal, cyan boxes denote clus… view at source ↗

**Figure 4.** Figure 4: Tang Three Departments topology. The Menxia gate’s view at source ↗

**Figure 5.** Figure 5: Per-task PinchBench score heatmap (MiniMax M2.5). Each cell represents one (institution, view at source ↗

**Figure 6.** Figure 6: Athenian Democracy topology. Seven citizen-voters with distinct ideological biases vote in view at source ↗

read the original abstract

Across millennia, complex societies have faced the same coordination problem of how to organize collective action among cognitively bounded and informationally incomplete individuals. Different civilizations developed different political institutions to answer the same basic questions of who proposes, who reviews, who executes, and how errors are corrected. We argue that multi-agent systems built on large language models face the same challenge. Their central problem is not only individual intelligence, but collective organization. Historical institutions therefore provide a structured design space for multi-agent architectures, making key trade-offs between efficiency and error correction, centralization and distribution, and specialization and redundancy empirically testable. We translate seven historical political institutions, spanning four canonical governance patterns, into executable multi-agent architectures and evaluate them under identical conditions across three large language models and two benchmarks. We find that governance topology strongly shapes collective performance. Within a single model, the gap between the best and worst institution exceeds 57 percentage points, while the optimal architecture shifts systematically with model capability and task characteristics. These results suggest that collective intelligence will not advance through a single optimal organizational form, but through governance mechanisms that can be reselected and reconfigured as tasks and capabilities evolve. More broadly, this points to a transition from \textbf{self-evolving agents} to the \textbf{self-evolving multi-agent system}. The code is available on \href{https://github.com/cf3i/SocialSystemArena}{GitHub}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows historical governance forms can be turned into multi-agent LLM topologies with large performance differences, but the results hinge on whether the mappings stay faithful to the original coordination trade-offs.

read the letter

The main thing to know is that this work takes seven historical political institutions, converts them into runnable multi-agent setups for LLMs, and reports performance gaps exceeding 57 points within the same model, with the best structure shifting by task and model size. That is the central empirical claim, and it is new in its concrete execution across multiple models and benchmarks rather than just conceptual mapping. The paper does a solid job laying out a design space that treats organization as a first-order variable instead of treating multi-agent systems as a collection of clever prompts. It also makes the point that no single topology wins across the board, which matches how real institutions adapt. The code release helps here, as it lets others inspect the actual implementations. The soft spot is the translation from history to code. The stress-test concern is fair: if the executable versions differ in prompt length, information routing, or error-correction mechanics in ways that are not core to the historical pattern, then the gaps could trace to those implementation choices rather than the intended topology. The paper would be tighter if it included checks like alternative encodings of the same institution or ablations that isolate the coordination features. The benchmarks also need to be evaluated for whether they actually surface the proposal-review-execution problems the institutions were built to handle. This is useful for people already working on multi-agent LLM systems who want structured alternatives to debate or hierarchy setups. Readers focused on AI governance or scalable collective reasoning will get the most out of it. It deserves a serious referee because the idea is testable, the results are large enough to matter if they hold, and the full methods and code are available for scrutiny. I would send it to review with requests for more on mapping robustness and statistical controls.

Referee Report

2 major / 2 minor

Summary. The paper translates seven historical political institutions (spanning four canonical governance patterns) into executable multi-agent architectures for LLMs. It evaluates these under identical conditions across three models and two benchmarks, claiming that governance topology strongly shapes collective performance—with gaps exceeding 57 percentage points between best and worst institutions within a single model—and that optimal architectures shift systematically with model capability and task characteristics. The work concludes that collective intelligence advances via reconfigurable governance rather than a single optimal form, pointing toward self-evolving multi-agent systems.

Significance. If the results prove robust, the paper offers a valuable, historically grounded empirical design space for multi-agent LLM systems, showing that organizational topology can produce larger performance effects than incremental model improvements alone. The open availability of code on GitHub supports reproducibility and enables follow-on work on adaptive collective intelligence. The interdisciplinary framing (political institutions to AI agents) is a strength, though its impact depends on addressing implementation fidelity.

major comments (2)

[§3] §3 (Institutional Translation): The central claim that topology drives the observed gaps requires evidence that the seven executable mappings faithfully preserve historical coordination trade-offs (proposal/review/execution/error-correction) without LLM-specific artifacts in prompt structure, information routing, or redundancy. The manuscript should add ablation studies testing alternative faithful encodings of the same institution; if results differ materially, the 57pp gaps and capability-dependent shifts cannot be attributed to topology alone.
[§4] §4 (Experiments and Results): The reported performance gaps (e.g., >57pp within-model) and systematic shifts in optimal architecture are load-bearing for the main thesis, yet the text provides insufficient detail on statistical tests, variance across random seeds or prompt variations, and controls for implementation choices. Without these, the empirical support remains vulnerable to the concern that gaps reflect particular protocol encodings rather than governance topology.

minor comments (2)

[Abstract] Abstract and §1: The phrase 'four canonical governance patterns' is used without immediate enumeration or reference; listing them explicitly early in the paper would improve readability.
[§6] §6 (Discussion): The transition from 'self-evolving agents' to 'self-evolving multi-agent system' is conceptually interesting but would benefit from a clearer operational definition or pseudocode sketch of how reselection of institutions would occur in practice.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing implementation fidelity and statistical rigor. These points help strengthen the attribution of results to institutional topology. We have revised the manuscript with additional ablations, expanded experimental details, statistical tests, and robustness checks as described below.

read point-by-point responses

Referee: [§3] §3 (Institutional Translation): The central claim that topology drives the observed gaps requires evidence that the seven executable mappings faithfully preserve historical coordination trade-offs (proposal/review/execution/error-correction) without LLM-specific artifacts in prompt structure, information routing, or redundancy. The manuscript should add ablation studies testing alternative faithful encodings of the same institution; if results differ materially, the 57pp gaps and capability-dependent shifts cannot be attributed to topology alone.

Authors: We agree that verifying robustness to alternative encodings is essential to isolate topology as the driver. The seven mappings were constructed from canonical political science sources to retain the core coordination primitives (proposal, review, execution, error correction) while adapting them to LLM agent interfaces. In the revision we add a new subsection (3.3) and Appendix B containing ablation studies on alternative faithful encodings: we vary prompt phrasing, information routing granularity, and redundancy mechanisms while holding the underlying topology fixed. Across these variants the institutional performance ordering and gap magnitudes remain stable (gaps still exceed 50pp in the majority of cases), with no material reversal of the main results or capability-dependent shifts. This indicates that the observed differences arise from the governance structures rather than LLM-specific prompt artifacts. revision: yes
Referee: [§4] §4 (Experiments and Results): The reported performance gaps (e.g., >57pp within-model) and systematic shifts in optimal architecture are load-bearing for the main thesis, yet the text provides insufficient detail on statistical tests, variance across random seeds or prompt variations, and controls for implementation choices. Without these, the empirical support remains vulnerable to the concern that gaps reflect particular protocol encodings rather than governance topology.

Authors: We accept that the original text under-specified these elements and have substantially expanded Section 4. The revision now includes: paired t-tests and ANOVA results (all key gaps p < 0.01); performance variance across 10 random seeds per configuration (standard deviations < 5pp); sensitivity to three alternative prompt phrasings per institution; and explicit controls (fixed temperature, token limits, and model hyperparameters across all runs). The >57pp within-model gaps and the systematic shifts in optimal architecture persist under these controls. A new summary table of robustness statistics has been added to the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results from direct evaluation

full rationale

The paper translates seven historical institutions into multi-agent architectures and reports performance gaps (e.g., >57pp within-model differences) from controlled experiments across three LLMs and two benchmarks. No equations, fitted parameters, or derivations are present that reduce the central claim to its inputs by construction. The mapping of institutions is a methodological premise whose outputs are externally falsifiable via the reported runs and released code; no self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing support. This is a standard empirical design with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that historical political institutions supply a useful design space for LLM multi-agent coordination; no free parameters or invented entities appear in the abstract.

axioms (1)

domain assumption Historical political institutions provide a structured design space for multi-agent architectures that captures key trade-offs between efficiency and error correction, centralization and distribution, and specialization and redundancy.
Invoked in the abstract as justification for translating institutions into executable architectures.

pith-pipeline@v0.9.0 · 9659 in / 1149 out tokens · 56166 ms · 2026-05-07T05:43:05.672652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages · 1 internal anchor

[1]

arXiv:2303.17760. 12 Jiaqing Liang, Jinyi Han, Weijia Li, Xinyi Wang, Zhoujia Zhang, Zishang Jiang, Ying Liao, Tingyun Li, Ying Huang, Hao Shen, Hanyu Wu, Fang Guo, Keyi Wang, Zhonghua Hong, Zhiyu Lu, Lipeng Ma, Sihang Jiang, and Yanghua Xiao. GenericAgent: A token-efficient self-evolving LLM agent via contextual information density maximization (v1.0).ar...

work page internal anchor Pith review arXiv 2026
[2]

G-designer: Architecting multi-agent communication topologies via graph neural networks.arXiv preprint arXiv:2410.11782, 2024

arXiv:2410.11782. Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. InProceedings of the 41st International Conference on Machine Learning (ICML), pages 62743–62767, 2024. Oral. arXiv:2402.16823. A Runtime Implementation Details This appendix provides de...

work page arXiv 2024
[3]

Soul prompt( σi): A per-agent markdown file encoding institutional persona, behavioral constraints, and output format requirements
[4]

Stage context: Task state, execution history, and any shared-state variables injected by feature plugins
[5]

Tool descriptions: The set of tools available to the agent at the current stage, serialized as function schemas
[6]

This layered design separates institutional norms (layer 1) from execution context (layers 2–4), allowing the same soul prompt to be reused across different runtime configurations

Format instructions: Pattern-specific output constraints (e.g., gate stages must emit an approve/rejectdecision; consensus voters must emit avotefield). This layered design separates institutional norms (layer 1) from execution context (layers 2–4), allowing the same soul prompt to be reused across different runtime configurations. A.2 Feature Plugins Fea...