arxiv: 2604.23626 · v1 · submitted 2026-04-26 · 💻 cs.CL

Recognition: unknown

GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs

Tao Feng , Haozhen Zhang , Zijie Lei , Peixuan Han , Jiaxuan You

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:15 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM routingmulti-agent LLMsgraph memoryreinforcement learningagentic workflowsMarkov Decision Processworkflow generationheterogeneous graphs

0 comments

The pith

GraphPlanner uses a heterogeneous graph memory and reinforcement learning to generate adaptive routing workflows for multi-agent LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GraphPlanner to address routing challenges in agentic settings where multiple LLMs must cooperate through planning, execution, and summarization steps. It models workflow creation for each query as a Markov Decision Process, with actions that pick both an LLM and a role such as Planner, Executor, or Summarizer. A heterogeneous graph called GARNet stores memories of past query-agent-response interactions, which are folded into the decision state so reinforcement learning can optimize the full pipeline for accuracy and efficiency. A sympathetic reader would care because current routers handle only simple single-round or multi-round cases and become impractical for realistic multi-agent applications that need memory and generalization.

Core claim

GraphPlanner generates routing workflows for each query by formulating the process as a Markov Decision Process where at each step the system selects both an LLM backbone and an agent role from Planner, Executor, and Summarizer. The state representation is enriched with historical and workflow memories drawn from a heterogeneous graph GARNet that records interactions among queries, agents, and responses. The full pipeline is trained end-to-end with reinforcement learning to improve task-specific performance while lowering computational cost, supporting both inductive and transductive inference on unseen tasks and models.

What carries the argument

The GARNet heterogeneous graph that captures interaction memories among queries, agents, and responses, which augments the state representation in the MDP formulation for joint RL optimization of routing decisions.

If this is right

GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% across 14 diverse LLM tasks.
It reduces GPU cost from 186.26 GiB to 1.04 GiB while maintaining or increasing performance.
It generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities.
It effectively leverages historical memories to support both inductive and transductive inference for more adaptive routing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The MDP-plus-graph formulation could be applied to other sequential routing or planning problems where past interaction data improves future decisions.
If the graph memory continues to scale, it may allow routers to adapt to new LLMs without full retraining by simply extending the stored interaction records.
Removing the graph component in an ablation would likely eliminate the reported gains in generalization and cost reduction.

Load-bearing premise

That formulating workflow generation as an MDP and optimizing it with RL on the GARNet graph memory will reliably produce superior routing decisions without the learned policy overfitting to the training tasks or the graph failing to capture useful long-term interaction patterns.

What would settle it

An evaluation on a new set of tasks or LLMs where GraphPlanner shows no accuracy improvement over strong single-round and multi-round baselines, or where the claimed GPU cost reduction from 186.26 GiB to 1.04 GiB does not occur.

Figures

Figures reproduced from arXiv: 2604.23626 by Haozhen Zhang, Jiaxuan You, Peixuan Han, Tao Feng, Zijie Lei.

**Figure 1.** Figure 1: Comparison between the agentic router, the single-round router, and the multi-round router. Specifically, the single-round router selects a model based only on the query, the multi-round router makes sequential selections using accumulated context, and the agentic router leverages a workflow graph to jointly choose agent roles and models for collaborative reasoning. The agentic router enables explicit coll… view at source ↗

**Figure 2.** Figure 2: Overview of GraphPlanner. In GraphPlanner, each decision step is guided by GARNet, which integrates Gworkf low and Ghistory to produce an action that specifies both the LLM and the agent role. The resulting trajectories are incrementally incorporated into Gworkf low at each step, while the complete episode trajectory is consolidated into Ghistory at the end of the episode. Note that boxes and circles shari… view at source ↗

**Figure 3.** Figure 3: Detailed illustration of Phase 1 Evaluation. In the Phase 1 evaluation, we primarily assess how well GraphPlanner optimizes user-defined LLM workflows. Without loss of generality, we construct the graph-based LLM workflow shown above and set two hyperparameters: Depth and Width. Here, Depth refers to the number of planners, and Width denotes the maximum number of sub-queries that each planner is allowed to… view at source ↗

**Figure 4.** Figure 4: Compared to baseline routers, GraphPlanner consistently forms the Pareto frontier, offering more efficient trade-offs between Acc and Cost. GraphPlanner (with α ∈ {0.0, 0.1, 0.3, 0.5, 0.9}) is compared against two single-round routers and two multiple-round routers. agentic routing within the user-predefined LLM workflows. In this phase, we specify different widths and depths for agentic workflows. The ta… view at source ↗

**Figure 5.** Figure 5: Comparison of GraphPlanner against baselines across different experimental settings in five scenarios under Phase-2. (a) Unseen LLMs generalization: We add the unseen LLMs—not introduced in the training in view at source ↗

**Figure 6.** Figure 6: Illustrative examples of GraphPlanner’s workflow generations in phase-2. The figure illustrates how GraphPlanner adaptively constructs different workflow paths for different task types. For math tasks, the planner first decomposes the query into several sub-queries. Each sub-query is processed by an executor, and the intermediate results are then merged by a summarizer before a final executor produces the … view at source ↗

read the original abstract

LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphPlanner combines a heterogeneous graph memory with an MDP policy for picking models and roles in multi-agent workflows, delivering reported cost cuts but resting on unablated generalization claims.

read the letter

The new piece is treating workflow generation as an MDP that selects both LLM backbone and agent role (planner, executor, summarizer) at each step, then feeding a heterogeneous graph called GARNet with past query-agent-response triples into the state for RL optimization. That setup lets the router pull historical memories for both inductive and transductive decisions, which is a clean way to extend single-round routers into multi-round agentic settings without hand-crafted prompts for every task.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs. It formulates workflow generation as an MDP in which the policy at each step selects both an LLM backbone and an agent role (Planner, Executor, or Summarizer). A heterogeneous graph GARNet encodes interaction memories among queries, agents, and responses and augments the state representation; the full pipeline is trained with reinforcement learning to jointly optimize task performance and efficiency. Experiments across 14 LLM tasks are reported to demonstrate up to 9.3% accuracy gains over strong single- and multi-round routers, a reduction in GPU cost from 186.26 GiB to 1.04 GiB, robust zero-shot generalization to unseen tasks and LLMs, and effective use of historical memories for both inductive and transductive inference. Code is released.

Significance. If the empirical results hold after addressing the evidentiary gaps, the work would constitute a useful extension of LLM routing into realistic agentic, multi-round settings by combining graph memory with RL-based workflow planning. The public code release is a clear strength that supports reproducibility and further research.

major comments (3)

[Experiments section] Experiments section: The central performance claims (up to 9.3% accuracy improvement and GPU cost reduction from 186.26 GiB to 1.04 GiB) are presented without error bars, detailed descriptions of the 14 tasks, baseline implementations, or ablation studies that isolate the contribution of the GARNet graph memory versus the MDP/RL formulation alone. This omission makes it impossible to verify whether the graph memory is load-bearing for the reported gains.
[Experiments / Generalization subsection] Generalization claims (zero-shot to unseen tasks/LLMs, inductive/transductive inference): No policy analysis, overfitting diagnostics, regularization details, or comparisons against non-graph RL baselines are provided. RL policies on graph-structured states are known to overfit training query distributions; without such evidence the generalization results cannot be taken as confirmed.
[Method section] Method section (GARNet and state representation): The precise update and retrieval mechanisms by which GARNet incorporates historical memories into the MDP state are described only at a high level. This leaves open whether the memory augmentation introduces circularity or simply memorizes training patterns rather than enabling robust inference.

minor comments (2)

[Abstract] The abstract introduces several acronyms (MDP, GARNet, LLM) without expansion on first use; a brief parenthetical definition would improve readability.
[Method section] Notation for the heterogeneous graph and its node/edge types is introduced without a compact mathematical definition or diagram reference, complicating replication of the state representation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the manuscript would benefit from greater experimental rigor and methodological precision to better substantiate the claims. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Experiments section] Experiments section: The central performance claims (up to 9.3% accuracy improvement and GPU cost reduction from 186.26 GiB to 1.04 GiB) are presented without error bars, detailed descriptions of the 14 tasks, baseline implementations, or ablation studies that isolate the contribution of the GARNet graph memory versus the MDP/RL formulation alone. This omission makes it impossible to verify whether the graph memory is load-bearing for the reported gains.

Authors: We agree that the current presentation lacks sufficient statistical rigor and component isolation. In the revised manuscript we will add error bars computed over at least five random seeds for all accuracy and cost metrics. Detailed descriptions of the 14 tasks will be moved to an appendix. Baseline implementations will be clarified with exact hyper-parameters and references to the released code. We will also include new ablation studies comparing the full model against (i) an MDP/RL variant without GARNet and (ii) a non-RL variant with fixed workflows, thereby isolating the contribution of the heterogeneous graph memory. revision: yes
Referee: [Experiments / Generalization subsection] Generalization claims (zero-shot to unseen tasks/LLMs, inductive/transductive inference): No policy analysis, overfitting diagnostics, regularization details, or comparisons against non-graph RL baselines are provided. RL policies on graph-structured states are known to overfit training query distributions; without such evidence the generalization results cannot be taken as confirmed.

Authors: We acknowledge the risk of overfitting in graph-augmented RL policies. The revised version will include (i) policy visualizations contrasting behavior on training versus held-out queries, (ii) training/validation performance curves and explicit regularization details (dropout in GARNet layers and entropy regularization in the RL objective), and (iii) direct comparisons against non-graph RL baselines. These additions will provide the requested diagnostics and strengthen the zero-shot, inductive, and transductive generalization claims. revision: yes
Referee: [Method section] Method section (GARNet and state representation): The precise update and retrieval mechanisms by which GARNet incorporates historical memories into the MDP state are described only at a high level. This leaves open whether the memory augmentation introduces circularity or simply memorizes training patterns rather than enabling robust inference.

Authors: We will expand the Method section with formal definitions of the update and retrieval operations, including pseudocode for memory insertion and similarity-based retrieval at each MDP step. We will explicitly clarify that memory updates occur after inference and that retrieval uses embedding similarity rather than exact pattern matching, thereby avoiding circularity. Additional discussion will explain how the joint RL objective favors generalizable routing policies over memorization, supported by the new generalization diagnostics. revision: yes

Circularity Check

0 steps flagged

No circularity: standard MDP+RL proposal with empirical results on held-out tasks

full rationale

The paper defines GraphPlanner by formulating workflow generation as an MDP (selecting LLM backbone and role at each step) and optimizing the policy with RL on the GARNet heterogeneous graph that encodes query-agent-response memories. All reported gains (accuracy, GPU reduction, zero-shot generalization) are presented as outcomes of training and evaluation on 14 tasks, not as quantities that reduce by construction to the fitted parameters or to a self-referential definition. No equations equate a claimed prediction to an input fit, no uniqueness theorem is imported from the same authors, and no ansatz is smuggled via self-citation. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only abstract available; ledger entries are inferred from stated components. No explicit free parameters or invented entities beyond GARNet are quantified.

axioms (1)

domain assumption Workflow generation for multi-agent LLMs can be modeled as a Markov Decision Process where each step selects an LLM backbone and an agent role.
Directly stated as the formulation used for routing.

invented entities (1)

GARNet no independent evidence
purpose: Heterogeneous graph that captures interaction memories among queries, agents, and responses to enrich state representations for routing.
Introduced as the core memory mechanism; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5596 in / 1352 out tokens · 64278 ms · 2026-05-08T06:15:08.908710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...