Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

Guohao Li; James Zou; Jie Chen; Jie Peng; Pingzhi Li; Sukwon Yun; Tianlong Chen; Wendong Fan

arxiv: 2604.17148 · v1 · submitted 2026-04-18 · 💻 cs.AI

Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

Sukwon Yun , Jie Peng , Pingzhi Li , Wendong Fan , Jie Chen , James Zou , Guohao Li , Tianlong Chen This is my paper

Pith reviewed 2026-05-10 06:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent LLMgraph neural networkagent selectionmessage passingmodel cardscollaborative reasoningbenchmark evaluation

0 comments

The pith

A graph connecting three selected LLMs via response relevance edges outperforms baselines that run all six agents at once on MMLU, MATH, and other benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a graph-based method to coordinate multiple LLMs by first picking the three most relevant models using their model cards, then building directed edges that rank how much one response can improve another. Messages then flow from stronger responses to weaker ones and back again to refine the originals before a final pooling step produces the answer. This structured selection and passing process yields higher accuracy than prior multi-agent setups that activate every available model simultaneously. A reader would care because the growing number of LLMs makes brute-force use of all of them inefficient, and a lighter graph approach could scale better while still boosting results on general and specialized tasks.

Core claim

Graph-of-Agents selects a subset of three agents from a pool of six using model-card summaries of domain and task fit, constructs directed edges by comparing each pair of responses to order relevance, performs forward message passing from higher-relevance agents to lower ones followed by reverse passing to refine the stronger responses, and finally aggregates the updated answers through graph pooling such as max or mean to produce a single output that exceeds the accuracy of full-pool baselines on MMLU, MMLU-Pro, GPQA, MATH, HumanEval, and MedMCQA.

What carries the argument

Graph-of-Agents framework: model-card node sampling to choose three agents, response-comparison edge construction to order relevance, bidirectional directed message passing along those edges, and graph pooling to combine final responses.

If this is right

Performance gains hold even when the agent pool grows larger, because only a small relevant subset needs to be activated.
Structured message passing replaces ad-hoc prompting or simultaneous querying, reducing both compute and coordination overhead.
The same graph can be reused across tasks by updating only the response edges while keeping the sampled nodes fixed.
Pooling after bidirectional refinement produces a single answer without requiring a separate judge model.
The approach separates agent selection from communication, allowing independent improvement of either stage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If response-based edges prove noisy on open-ended generation tasks, replacing them with embedding similarity or external fact-check scores could stabilize the graph.
The method implies that LLM collaboration benefits more from selective connectivity than from sheer numbers, which could be tested by varying pool size while fixing the three-node subgraph.
Extending the same directed passing and pooling to non-LLM agents such as code interpreters or search tools would test whether the graph structure generalizes beyond language models.

Load-bearing premise

Model cards accurately reflect each LLM's specialization so the three chosen nodes are the right ones, and comparing raw responses produces reliable relevance orderings that improve answers rather than introduce new errors.

What would settle it

On a fresh benchmark outside the reported set, running Graph-of-Agents with its full six-agent pool produces lower accuracy than the three-agent version, or random selection of any three agents matches the reported performance gains.

read the original abstract

With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently. In this work, we propose Graph-of-Agents (GoA), a new graph-based framework for modeling multi-agent LLM communication. Our approach begins with node sampling, selecting only the most relevant agents by leveraging model cards that summarize each model's domain, task specialization, and other characteristics. Next, we construct edges between the selected agents by evaluating their responses against one another to determine relevance ordering. Directed message passing is then performed from highly relevant agents to less relevant ones to enhance their responses, followed by reverse message passing to refine the original responses of the more relevant agents. Finally, the updated responses are aggregated via graph-based pooling (e.g., max or mean pooling) to produce a single, unified answer. We evaluate GoA on diverse multi-domain benchmarks (MMLU, MMLU-Pro, GPQA) and domain-specific benchmarks (MATH, HumanEval, MedMCQA), with an agent pool of 6 LLMs spanning multiple domains. Surprisingly, GoA achieves superior performance using only 3 selected agents, outperforming recent multi-agent LLM baselines that utilize all 6 agents simultaneously. By adopting a graph structure, GoA offers both scalability and effectiveness through structured message passing-positioning it as a strong candidate for navigating the challenges of the ever-growing LLM zoo. Code is available at: https://github.com/UNITES-Lab/GoA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes Graph-of-Agents (GoA), a graph-based framework for multi-agent LLM collaboration. It begins with node sampling to select the most relevant agents from a pool of 6 LLMs using model cards describing domain and task specialization. Directed edges are then constructed by using LLMs to evaluate pairwise response relevance and impose an ordering. Directed message passing flows from high-relevance to low-relevance nodes to enhance responses, followed by reverse refinement, with final aggregation via graph pooling (max or mean). Evaluated on MMLU, MMLU-Pro, GPQA, MATH, HumanEval, and MedMCQA, GoA claims superior performance using only 3 selected agents compared to recent multi-agent baselines that use all 6 agents simultaneously. Code is released.

Significance. If the empirical claims hold after validation, GoA offers a scalable alternative to flat multi-agent orchestration by enabling selective participation and structured, directed communication. This addresses key limitations in frameworks like Mixture-of-Agents regarding agent selection and intra-agent information flow. The graph formulation and public code are strengths that could support follow-on work on dynamic LLM ensembles.

major comments (3)

[§3.2] §3.2 (Edge Construction via Response Relevance): The directed message-passing pipeline rests on LLM-generated pairwise relevance ordering to build edges. No ablation (e.g., random ordering or no ordering) or external validation (inter-annotator agreement with human raters on ordering quality) is reported. This is load-bearing for the central claim that the graph structure improves answers rather than merely inheriting gains from node sampling.
[§4] §4 (Experiments and Results): The headline result—superior performance with 3 agents over 6-agent baselines—is stated without specific accuracy numbers, standard deviations, or implementation details for the baselines (e.g., exact MoA configuration). Tables or figures reporting these metrics are needed to assess effect size and statistical reliability.
[§3.1] §3.1 (Node Sampling): The assumption that model cards accurately capture task-relevant specialization for effective sampling is stated but not tested (e.g., via an oracle or human-verified relevance ranking of the 6 models per benchmark). This choice directly affects which 3 agents are selected and therefore the reported gains.

minor comments (3)

[§3.3] The description of graph-based pooling (max/mean) in §3.3 would benefit from explicit pseudocode or a small worked example showing how node embeddings are combined into the final answer.
[Figure 1] Figure 1 (framework overview) should label the direction of the two message-passing phases and the final aggregation step for clarity.
[Related Work] A few citations to prior graph-based multi-agent or LLM routing work appear missing in the related-work section; adding them would better situate the contribution.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our Graph-of-Agents framework. We value the emphasis on isolating the contributions of each component and providing clearer experimental details. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3.2] §3.2 (Edge Construction via Response Relevance): The directed message-passing pipeline rests on LLM-generated pairwise relevance ordering to build edges. No ablation (e.g., random ordering or no ordering) or external validation (inter-annotator agreement with human raters on ordering quality) is reported. This is load-bearing for the central claim that the graph structure improves answers rather than merely inheriting gains from node sampling.

Authors: We agree that ablations are essential to substantiate the benefit of the relevance-based directed edges. In the revised manuscript we will add a dedicated ablation subsection comparing (i) our LLM-generated relevance ordering, (ii) random ordering of the same nodes, and (iii) a no-ordering baseline that performs simultaneous aggregation without message passing. These experiments will be run on the same benchmarks and agent pool. Regarding external validation, a full inter-annotator agreement study with human raters was not performed in the original work owing to annotation cost and scale; we will instead report consistency statistics across multiple LLM judges and add a qualitative discussion of ordering quality, together with a small-scale human spot-check on a subset of examples if space allows. revision: partial
Referee: [§4] §4 (Experiments and Results): The headline result—superior performance with 3 agents over 6-agent baselines—is stated without specific accuracy numbers, standard deviations, or implementation details for the baselines (e.g., exact MoA configuration). Tables or figures reporting these metrics are needed to assess effect size and statistical reliability.

Authors: We apologize for the lack of explicit numerical reporting in the main narrative. The revised manuscript will expand the experimental section with complete tables that list per-benchmark accuracy for GoA (with standard deviations from at least three independent runs) alongside the corresponding numbers for all baselines, including the precise Mixture-of-Agents configuration (number of layers, agent count, and prompting template). We will also add a supplementary table or figure that directly contrasts the 3-agent GoA setting against the 6-agent baselines to make effect sizes transparent. revision: yes
Referee: [§3.1] §3.1 (Node Sampling): The assumption that model cards accurately capture task-relevant specialization for effective sampling is stated but not tested (e.g., via an oracle or human-verified relevance ranking of the 6 models per benchmark). This choice directly affects which 3 agents are selected and therefore the reported gains.

Authors: We acknowledge that the sampling step rests on an untested assumption about model-card fidelity. In the revision we will add (i) an explicit listing of the three agents chosen for each benchmark together with the model-card excerpts that drove the selection, and (ii) a new baseline experiment that replaces model-card sampling with random selection of three agents from the same pool. The performance gap between these two selection strategies will be reported to quantify the value of the model-card approach. A complete oracle or exhaustive human ranking of all six models across every benchmark is resource-intensive and was not performed; we will therefore frame the model cards as a practical, reproducible proxy and discuss its limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivations or fitted predictions

full rationale

The paper presents an algorithmic framework (node sampling from model cards, response-based edge construction, directed message passing, reverse refinement, and graph pooling) evaluated empirically on benchmarks. No equations, first-principles derivations, or parameter-fitting steps are described that could reduce to self-definition or fitted inputs. Performance claims rest on experimental comparisons against baselines rather than any internal construction that forces the outcome. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The approach is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim relies on assumptions about model cards and response evaluation being effective proxies for relevance, with no free parameters explicitly fitted but hyperparameters like number of agents (3) chosen.

axioms (2)

domain assumption Model cards provide accurate summaries of LLM capabilities for selection.
Used in node sampling step.
domain assumption Comparing responses can determine relevance ordering for edges.
Used in edge construction.

invented entities (1)

Graph-of-Agents framework no independent evidence
purpose: To model multi-agent LLM communication via graphs.
New proposed structure, no independent evidence beyond the paper's experiments.

pith-pipeline@v0.9.0 · 5638 in / 1394 out tokens · 43899 ms · 2026-05-10T06:00:37.728329+00:00 · methodology

Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)