Recognition: unknown
GraphPlanner: Graph Memory-Augmented Agentic Routing for Multi-Agent LLMs
Pith reviewed 2026-05-08 06:15 UTC · model grok-4.3
The pith
GraphPlanner uses a heterogeneous graph memory and reinforcement learning to generate adaptive routing workflows for multi-agent LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GraphPlanner generates routing workflows for each query by formulating the process as a Markov Decision Process where at each step the system selects both an LLM backbone and an agent role from Planner, Executor, and Summarizer. The state representation is enriched with historical and workflow memories drawn from a heterogeneous graph GARNet that records interactions among queries, agents, and responses. The full pipeline is trained end-to-end with reinforcement learning to improve task-specific performance while lowering computational cost, supporting both inductive and transductive inference on unseen tasks and models.
What carries the argument
The GARNet heterogeneous graph that captures interaction memories among queries, agents, and responses, which augments the state representation in the MDP formulation for joint RL optimization of routing decisions.
If this is right
- GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% across 14 diverse LLM tasks.
- It reduces GPU cost from 186.26 GiB to 1.04 GiB while maintaining or increasing performance.
- It generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities.
- It effectively leverages historical memories to support both inductive and transductive inference for more adaptive routing.
Where Pith is reading between the lines
- The MDP-plus-graph formulation could be applied to other sequential routing or planning problems where past interaction data improves future decisions.
- If the graph memory continues to scale, it may allow routers to adapt to new LLMs without full retraining by simply extending the stored interaction records.
- Removing the graph component in an ablation would likely eliminate the reported gains in generalization and cost reduction.
Load-bearing premise
That formulating workflow generation as an MDP and optimizing it with RL on the GARNet graph memory will reliably produce superior routing decisions without the learned policy overfitting to the training tasks or the graph failing to capture useful long-term interaction patterns.
What would settle it
An evaluation on a new set of tasks or LLMs where GraphPlanner shows no accuracy improvement over strong single-round and multi-round baselines, or where the claimed GPU cost reduction from 186.26 GiB to 1.04 GiB does not occur.
Figures
read the original abstract
LLM routing has achieved promising results in integrating the strengths of diverse models while balancing efficiency and performance. However, to support more realistic and challenging applications, routing must extend into agentic LLM settings, where task planning, multi-round cooperation among heterogeneous agents, and memory utilization are indispensable. To address this gap, we propose GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs that generates routing workflows for each query and supports both inductive and transductive inference. GraphPlanner formulates workflow generation as a Markov Decision Process (MDP), where at each step it selects both the LLM backbone and the agent role, including Planner, Executor, and Summarizer. By leveraging a heterogeneous graph, denoted as GARNet, to capture interaction memories among queries, agents, and responses, GraphPlanner integrates historical memory and workflow memory into richer state representations. The entire pipeline is optimized with reinforcement learning, jointly improving task-specific performance and computational efficiency. We evaluate GraphPlanner across 14 diverse LLM tasks and demonstrate that: (1) GraphPlanner outperforms strong single-round and multi-round routers, improving accuracy by up to 9.3% while reducing GPU cost from 186.26 GiB to 1.04 GiB; (2) GraphPlanner generalizes robustly to unseen tasks and LLMs, exhibiting strong zero-shot capabilities; and (3) GraphPlanner effectively leverages historical memories, supporting both inductive and transductive inference for more adaptive routing. Our code for GraphPlanner is released at https://github.com/ulab-uiuc/GraphPlanner.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GraphPlanner, a heterogeneous graph memory-augmented agentic router for multi-agent LLMs. It formulates workflow generation as an MDP in which the policy at each step selects both an LLM backbone and an agent role (Planner, Executor, or Summarizer). A heterogeneous graph GARNet encodes interaction memories among queries, agents, and responses and augments the state representation; the full pipeline is trained with reinforcement learning to jointly optimize task performance and efficiency. Experiments across 14 LLM tasks are reported to demonstrate up to 9.3% accuracy gains over strong single- and multi-round routers, a reduction in GPU cost from 186.26 GiB to 1.04 GiB, robust zero-shot generalization to unseen tasks and LLMs, and effective use of historical memories for both inductive and transductive inference. Code is released.
Significance. If the empirical results hold after addressing the evidentiary gaps, the work would constitute a useful extension of LLM routing into realistic agentic, multi-round settings by combining graph memory with RL-based workflow planning. The public code release is a clear strength that supports reproducibility and further research.
major comments (3)
- [Experiments section] Experiments section: The central performance claims (up to 9.3% accuracy improvement and GPU cost reduction from 186.26 GiB to 1.04 GiB) are presented without error bars, detailed descriptions of the 14 tasks, baseline implementations, or ablation studies that isolate the contribution of the GARNet graph memory versus the MDP/RL formulation alone. This omission makes it impossible to verify whether the graph memory is load-bearing for the reported gains.
- [Experiments / Generalization subsection] Generalization claims (zero-shot to unseen tasks/LLMs, inductive/transductive inference): No policy analysis, overfitting diagnostics, regularization details, or comparisons against non-graph RL baselines are provided. RL policies on graph-structured states are known to overfit training query distributions; without such evidence the generalization results cannot be taken as confirmed.
- [Method section] Method section (GARNet and state representation): The precise update and retrieval mechanisms by which GARNet incorporates historical memories into the MDP state are described only at a high level. This leaves open whether the memory augmentation introduces circularity or simply memorizes training patterns rather than enabling robust inference.
minor comments (2)
- [Abstract] The abstract introduces several acronyms (MDP, GARNet, LLM) without expansion on first use; a brief parenthetical definition would improve readability.
- [Method section] Notation for the heterogeneous graph and its node/edge types is introduced without a compact mathematical definition or diagram reference, complicating replication of the state representation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the manuscript would benefit from greater experimental rigor and methodological precision to better substantiate the claims. We address each major comment below and will incorporate the suggested revisions in the next version of the manuscript.
read point-by-point responses
-
Referee: [Experiments section] Experiments section: The central performance claims (up to 9.3% accuracy improvement and GPU cost reduction from 186.26 GiB to 1.04 GiB) are presented without error bars, detailed descriptions of the 14 tasks, baseline implementations, or ablation studies that isolate the contribution of the GARNet graph memory versus the MDP/RL formulation alone. This omission makes it impossible to verify whether the graph memory is load-bearing for the reported gains.
Authors: We agree that the current presentation lacks sufficient statistical rigor and component isolation. In the revised manuscript we will add error bars computed over at least five random seeds for all accuracy and cost metrics. Detailed descriptions of the 14 tasks will be moved to an appendix. Baseline implementations will be clarified with exact hyper-parameters and references to the released code. We will also include new ablation studies comparing the full model against (i) an MDP/RL variant without GARNet and (ii) a non-RL variant with fixed workflows, thereby isolating the contribution of the heterogeneous graph memory. revision: yes
-
Referee: [Experiments / Generalization subsection] Generalization claims (zero-shot to unseen tasks/LLMs, inductive/transductive inference): No policy analysis, overfitting diagnostics, regularization details, or comparisons against non-graph RL baselines are provided. RL policies on graph-structured states are known to overfit training query distributions; without such evidence the generalization results cannot be taken as confirmed.
Authors: We acknowledge the risk of overfitting in graph-augmented RL policies. The revised version will include (i) policy visualizations contrasting behavior on training versus held-out queries, (ii) training/validation performance curves and explicit regularization details (dropout in GARNet layers and entropy regularization in the RL objective), and (iii) direct comparisons against non-graph RL baselines. These additions will provide the requested diagnostics and strengthen the zero-shot, inductive, and transductive generalization claims. revision: yes
-
Referee: [Method section] Method section (GARNet and state representation): The precise update and retrieval mechanisms by which GARNet incorporates historical memories into the MDP state are described only at a high level. This leaves open whether the memory augmentation introduces circularity or simply memorizes training patterns rather than enabling robust inference.
Authors: We will expand the Method section with formal definitions of the update and retrieval operations, including pseudocode for memory insertion and similarity-based retrieval at each MDP step. We will explicitly clarify that memory updates occur after inference and that retrieval uses embedding similarity rather than exact pattern matching, thereby avoiding circularity. Additional discussion will explain how the joint RL objective favors generalizable routing policies over memorization, supported by the new generalization diagnostics. revision: yes
Circularity Check
No circularity: standard MDP+RL proposal with empirical results on held-out tasks
full rationale
The paper defines GraphPlanner by formulating workflow generation as an MDP (selecting LLM backbone and role at each step) and optimizing the policy with RL on the GARNet heterogeneous graph that encodes query-agent-response memories. All reported gains (accuracy, GPU reduction, zero-shot generalization) are presented as outcomes of training and evaluation on 14 tasks, not as quantities that reduce by construction to the fitted parameters or to a self-referential definition. No equations equate a claimed prediction to an input fit, no uniqueness theorem is imported from the same authors, and no ansatz is smuggled via self-citation. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Workflow generation for multi-agent LLMs can be modeled as a Markov Decision Process where each step selects an LLM backbone and an agent role.
invented entities (1)
-
GARNet
no independent evidence
Reference graph
Works this paper leans on
-
[1]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
@esa (Ref
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[3]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[4]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.