arxiv: 2604.00901 · v2 · submitted 2026-04-01 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

Sha Li , Naren Ramakrishnan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent RAGprompt evolutionorchestration learningreward-guided adaptationcredit assignmenthierarchical agentsknowledge-intensive benchmarks

0 comments

The pith

HERA evolves multi-agent RAG orchestration and agent prompts via rewards and credit assignment for better complex queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes HERA to fix brittle performance in multi-agent RAG systems that rely on static agent roles and fixed orchestration. It introduces a hierarchical framework that optimizes query-specific agent topologies at the global level through reward-guided sampling and accumulated experience. At the local level it refines individual agent behaviors with Role-Aware Prompt Evolution that uses credit assignment and dual-axes adaptation. On six knowledge-intensive benchmarks this produces an average 38.69% gain over recent baselines while preserving token efficiency. The approach also yields emergent self-organization into sparse yet effective multi-agent networks.

Core claim

HERA is a hierarchical framework that jointly evolves multi-agent orchestration and role-specific agent prompts: at the global level it optimizes query-specific agent topologies through reward-guided sampling and experience accumulation, while at the local level Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation along operational and behavioral principles, delivering an average 38.69% improvement over recent baselines on six knowledge-intensive benchmarks together with robust generalization, token efficiency, and emergent self-organization into compact high-utility networks.

What carries the argument

Hierarchical evolution that combines reward-guided topology sampling at the global level with credit-assignment-driven prompt adaptation at the local level.

If this is right

Sparse exploration during topology sampling produces compact high-utility multi-agent networks.
Token consumption remains efficient despite the added adaptation steps.
Performance gains hold across diverse multi-hop and knowledge-intensive tasks.
Emergent self-organization appears in the learned agent coordination patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward-and-credit mechanism could be tested on agent systems that use tools or perform planning rather than pure retrieval.
Automating prompt and topology changes might lower the amount of manual engineering needed when deploying multi-agent setups in new domains.
Accumulated experience across queries could be reused as a growing library of successful sub-networks for faster adaptation on future similar problems.

Load-bearing premise

Reward signals and credit assignments from the benchmarks reliably indicate genuine improvements in reasoning rather than benchmark-specific overfitting or reward misspecification.

What would settle it

A sharp drop in gains when HERA is tested on newly authored multi-hop questions whose reasoning patterns differ from those in the original six benchmarks would falsify the claim of robust generalization.

Figures

Figures reproduced from arXiv: 2604.00901 by Naren Ramakrishnan, Sha Li.

**Figure 1.** Figure 1: Overview of HERA. A hierarchical framework that jointly evolves orchestration strategies, the experience library, and agent prompts. 3.1 Orchestrator: Structure-Level Policy Optimization The orchestrator’s optimization is inspired by Group Relative Policy Optimization (GRPO) (Shao et al., 2024; Cai et al., 2025), which updates a policy by comparing sampled actions within a group. Unlike the training-free G… view at source ↗

**Figure 2.** Figure 2: Ablation Studies of HERA with Qwen-3-14B as the backbone [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Token usage We visualize the token consumption throughout the learning process by Locally Weighted Scatterplot Smoothing (LOWESS) (Cleveland, 1981; Dang et al., 2025) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of Performance–Token Efficiency Trade-off with Selected Baselines. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Transition entropy Transition Entropy quantifies the uncertainty in agent-toagent transitions, capturing the policy-level dynamics. Let Prob.(Nj |Ni) denote the empirical transition probability from agent Ni to Nj , the Transition Entropy is defined as: Htrans = − ∑i,j Prob.(Ni → Nj) log Prob.(Ni → Nj) We compute Htrans using a sliding-window over learning. Beyond the exploration–exploitation dynamics, [… view at source ↗

**Figure 6.** Figure 6: Graph metrics To systematically characterize the topology evolution of HERA, we model each trajectory as a graph Gτ = (V, E) where nodes represent agents and edges denote interactions. We quantify structural and functional properties via defining graph-theoretic metrics: (a) Number of agents |V|: the total distinct agents involved in a trajectory, reflecting the breadth of collaboration. (b) Node efficien… view at source ↗

**Figure 7.** Figure 7: Distribution of Reasoning Types and Complexity of Datasets [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation Studies of HERA with Llama-3.1-8 as the backbone. G Pseudocode The pseudocode for the proposed HERA is presented below. Algorithm 1 HERA — Top Level Require: query q, corpus D, orchestrator πO, iterations T Ensure: optimized experience library E, evolved agent prompts {ρ1, . . . , ρK} 1: E ← ∅ 2: N ← INITIALIZEAGENTS() ▷ each Ni = (πi , ρi , Ti) 3: for t = 1 to T do 4: Sample query q ∼ Q ▷ Orchest… view at source ↗

**Figure 9.** Figure 9: Case 1 - Comparison Multi-hop QA H.2 Case 2 - Causal Multi-hop QA [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Case 2 - Causal Multi-hop QA For this causal multi-hop question ( [PITH_FULL_IMAGE:figures/full_fig_p031_10.png] view at source ↗

**Figure 11.** Figure 11: Case 3 - Temporal Multi-hop QA their intersection (which would correctly be 1921–1938). This failure reveals a fundamental limitation: the system lacks an explicit set-theoretic reasoning module. This suggests that for questions requiring numerical or logical operations over retrieved facts, a dedicated symbolic computation step is necessary. The failure is not a retrieval failure but a reasoning composit… view at source ↗

**Figure 12.** Figure 12: Case 4 - Intersection Multi-hop QA For Case 4 (Fig.12), which is a Intersection Multi-Hop type, the error illustrates a critical challenge in handling multi-entity property overlap. The pipeline’s goal is to identify a property that applies to both Pavel Alexandrov and Valentin Turchin. Here, the intended property is “Soviet”. However, the Query Rewriter reformulated the query around “what they were known… view at source ↗

read the original abstract

Multi-agent Retrieval-Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration strategies, leading to brittle performance on diverse, multi-hop tasks. We identify two key limitations: the lack of continuously adaptive orchestration mechanisms and the absence of behavior-level learning for individual agents. To this end, we propose HERA, a hierarchical framework that jointly evolves multi-agent orchestration and role-specific agent prompts. At the global level, HERA optimizes query-specific agent topologies through reward-guided sampling and experience accumulation. At the local level, Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation along operational and behavioral principles, enabling targeted, role-conditioned improvements. On six knowledge-intensive benchmarks, HERA achieves an average improvement of 38.69\% over recent baselines while maintaining robust generalization and token efficiency. Topological analyses reveal emergent self-organization, where sparse exploration yields compact, high-utility multi-agent networks, demonstrating both efficient coordination and robust reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HERA adds a concrete hierarchical loop for evolving both agent topologies and prompts in multi-agent RAG, but the 38% gains rest on unshown reward details and missing compute-matched ablations.

read the letter

The paper's core move is to run two coupled adaptation processes: reward-guided sampling that changes which agents talk to each other for each query, plus per-agent prompt updates along operational and behavioral axes using credit assignment. That pairing is not in the static multi-agent RAG papers it cites, so the architecture itself is new. The abstract also notes that the resulting topologies become sparse and self-organized, which is a nice empirical observation if it holds up in the full results. The work is aimed at people already building multi-agent systems for multi-hop retrieval and reasoning tasks; those readers will recognize the two limitations it targets and may find the dual-level design useful as a starting point. The quantitative claim is an average 38.69% lift on six benchmarks with maintained token efficiency. That number is presented as an outcome on external data rather than a fitted quantity, which is the right direction. The main gaps are exactly the ones the stress-test note flags. The reward function weights are not given, baseline implementations are not described, and there are no ablations that disable the evolution steps while holding total inference budget fixed. Without those, the lift could come from extra compute, implicit tuning on the test sets, or reward misspecification rather than from the claimed mechanisms. The paper is coherent on its own terms and engages the prior literature directly, so it deserves referee time once the methods section supplies the missing controls. I would send it out for review rather than desk-reject; the idea is falsifiable and the evaluation plan is clear enough that referees can check whether the gains survive tighter controls.

Referee Report

3 major / 2 minor

Summary. The paper proposes HERA, a hierarchical framework for multi-agent RAG that jointly evolves query-specific agent topologies at the global level via reward-guided sampling and experience accumulation, and refines individual agent prompts at the local level via Role-Aware Prompt Evolution using credit assignment and dual-axes adaptation along operational and behavioral principles. On six knowledge-intensive benchmarks, the authors report an average 38.69% improvement over recent baselines, with additional claims of robust generalization, token efficiency, and emergent self-organization in sparse, high-utility topologies.

Significance. If the performance gains prove robust after proper controls, the work would contribute a concrete mechanism for experience-driven adaptation in multi-agent systems, moving beyond static orchestration. The combination of global topology evolution and local prompt refinement, plus the topological analysis of self-organization, offers a potentially useful lens on efficient coordination for multi-hop tasks. The emphasis on token efficiency is a positive practical angle.

major comments (3)

[Abstract] Abstract: the central claim of a 38.69% average improvement is presented without any description of baseline implementations, data splits, statistical tests, or ablation controls that isolate the contribution of reward-guided sampling and credit assignment from increased per-query compute or implicit hyperparameter tuning on the six benchmarks.
[Abstract] The reward function used for global-level sampling is never specified (e.g., explicit weighting of accuracy versus token cost versus topology sparsity), so it is impossible to determine whether the reported gains reflect genuine adaptation or reward misspecification that favors benchmark artifacts.
[Abstract] No ablation is described that disables evolution while holding total inference budget fixed; without this, the performance lift cannot be attributed to the hierarchical adaptation mechanisms rather than simply allocating more tokens per query.

minor comments (2)

[Abstract] The abstract refers to 'recent baselines' without naming them or citing the corresponding papers; this should be expanded in the main text for reproducibility.
[Abstract] The phrase 'dual-axes adaptation along operational and behavioral principles' is introduced without a concrete definition or pseudocode; a short clarifying paragraph or algorithm box would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity in the abstract regarding experimental controls, the reward function, and budget-matched ablations. We have revised the abstract to incorporate these details and added explicit cross-references to the relevant sections and new ablation results in the main text. All changes are documented below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a 38.69% average improvement is presented without any description of baseline implementations, data splits, statistical tests, or ablation controls that isolate the contribution of reward-guided sampling and credit assignment from increased per-query compute or implicit hyperparameter tuning on the six benchmarks.

Authors: We agree the original abstract was overly concise. The revised abstract now states that the 38.69% average gain is measured against the strongest reported baselines from the literature (e.g., ReAct, Reflexion, and recent multi-agent RAG systems) using the standard train/test splits of the six benchmarks (HotpotQA, 2WikiMultihopQA, MuSiQue, StrategyQA, FEVER, and TriviaQA). Statistical significance is assessed via paired t-tests (p < 0.05) across five random seeds. Ablation controls isolating reward-guided sampling and credit assignment appear in Sections 5.2 and 5.3; these hold total inference tokens fixed where possible and show the gains are not attributable to hyperparameter tuning alone. revision: yes
Referee: [Abstract] The reward function used for global-level sampling is never specified (e.g., explicit weighting of accuracy versus token cost versus topology sparsity), so it is impossible to determine whether the reported gains reflect genuine adaptation or reward misspecification that favors benchmark artifacts.

Authors: The reward function is explicitly defined in Section 3.2 as R = 0.7 * accuracy + 0.2 * (-token_cost) + 0.1 * sparsity_penalty, where sparsity_penalty = -log(number_of_agents). We have added this formulation to the revised abstract. The weights were selected via a small grid search on a held-out validation set (reported in Appendix B) and sensitivity analysis confirms that performance remains stable across reasonable weight ranges (0.6-0.8 for accuracy). No evidence of reward hacking on benchmark artifacts was observed; the same reward yields consistent gains on out-of-distribution queries. revision: yes
Referee: [Abstract] No ablation is described that disables evolution while holding total inference budget fixed; without this, the performance lift cannot be attributed to the hierarchical adaptation mechanisms rather than simply allocating more tokens per query.

Authors: We acknowledge the importance of this control. In the revised manuscript we added Section 5.3 containing a budget-matched ablation: the non-evolving baseline is given an equivalent total token budget per query by increasing its retrieval depth and agent invocations until its average token count matches HERA. Under this fixed-budget regime HERA still outperforms the non-evolving variant by 21.4% on average, indicating that the gains arise from the adaptive topology sampling and prompt evolution rather than raw compute. A new table (Table 4) reports per-query token counts for all methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims are empirical measurements on external benchmarks

full rationale

The manuscript describes HERA as a hierarchical framework using reward-guided sampling for topologies and credit assignment for prompt evolution, with central results consisting of measured average improvements (38.69%) on six external knowledge-intensive benchmarks. No load-bearing equations, derivations, or self-citations are present that reduce these outcomes to fitted parameters, self-definitions, or ansatzes imported from prior author work. The reported gains are treated as observed quantities against independent test sets rather than quantities defined in terms of the method's own inputs by construction, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on standard reinforcement-learning-style assumptions for reward signals and credit assignment; no new physical entities are postulated, but several operational parameters remain unspecified.

free parameters (2)

reward function weights
Used to score sampled topologies; exact formulation and any fitted coefficients are not stated.
adaptation step sizes
Control how aggressively prompts are updated along the two axes; values are not provided.

axioms (1)

domain assumption Credit assignment can reliably attribute outcome quality to individual agent behaviors
Invoked at the local level to drive prompt evolution.

pith-pipeline@v0.9.0 · 5493 in / 1291 out tokens · 24562 ms · 2026-05-13T22:50:24.475330+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HERA jointly evolves global orchestration and agent prompts through experience and reflection... group-relative semantic advantages... Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The orchestrator admits a natural Expectation-Maximization (EM) interpretation... energy-based reweighting of candidates... implicit KL-constrained policy optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

emnlp-main.154/

URLhttps://aclanthology.org/2020.coling-main.580/. Yue Hu, Yuzhu Cai, Yaxin Du, Xinyu Zhu, Xiangrui Liu, Zijie Yu, Yuchen Hou, Shuo Tang, and Siheng Chen. Self-evolving multi-agent collaboration networks for software development.arXiv preprint arXiv:2410.16946, 2024. 11 Preprint. Under review. Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, ...

work page doi:10.18653/v1/2024.naacl-long 2020
[2]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

URLhttps://aclanthology.org/2024.naacl-long.389/. Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park. Adaptive- rag: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.naacl-long.364 2024
[3]

Hanlin Zhou and Huah Yong Chan

URLhttps://aclanthology.org/2025.emnlp-main.22/. Hanlin Zhou and Huah Yong Chan. Orch: many analyses, one merge-a deterministic multi-agent orchestrator for discrete-choice reasoning with ema-guided routing.arXiv preprint arXiv:2602.01797, 2026. Junda Zhu, Lingyong Yan, Haibo Shi, Dawei Yin, and Lei Sha. Atm: Adversarial tuning multi-agent system makes a ...

work page arXiv 2025
[4]

E-step (Implicit): Sampling Γi ∼π O(· |q,E (t),N) and selecting high-reward topologies induces an approximate posterior over effective reasoning programs: P (t) (Γ)≈ πO(· |q,E (t),N)·[R(τ i)> ¯R] ∑Γ′ πθo (Γ′ |q,E (t),N))·1[R(τ Γ′ )> ¯R] This represents a reward-filtered empirical distribution instead of a properly normalized probabilistic posterior. M-ste...

work page
[5]

Selecting the subset of agents best suited to this query type

work page
[6]

Specifying their execution order (sequential or parallel where appropriate)

work page
[7]

query_profile

Defining dependency relationships (which agent’s output feeds into which). Query:{query} Respond in the following JSON format: { "query_profile": "<one-sentence characterization of query type>", "selected_agents": ["<agent_name>", ...], "execution_order": [ {"step": 1, "agent": "<agent_name>", "depends_on": [], "mode": "sequential|parallel"}, ... ] } 21 P...

work page
[8]

Identify what the successful trajectories did differently from the failed ones

work page
[9]

Identify which agents, orderings, or dependencies contributed to success or failure

work page
[10]

Identify recurring failure patterns (e.g., incorrect agent selection, redundant steps, missing retrieval before reasoning)

work page
[11]

success_factors

Distill findings into concise, actionable insights applicable to future queries of the same type. Respond in the following JSON format: { "success_factors": ["<factor>", ...], "failure_modes": ["<failure_mode>", ...], "insights": [ { "query_type": "<type this insight applies to>", "insight": "<actionable natural language insight>" }, ... ] } 22 Preprint. ...

work page
[12]

Limit the number of tactical rules to a maximum ofK

work page
[13]

All instructions must be internally consistent — no contradictions

work page
[14]

Preserve the agent’s core role definition and tool usage instructions

work page
[15]

Remove redundant, contradictory, or ambiguous instructions

work page
[16]

Preserve essential operational and behavioral requirements

work page
[17]

Ensure the updated prompt is concise, coherent, and actionable. Task: Produce aprompt diffthat clearly indicates the modifications required to integrate the proposed rules and principles into the current agent prompt while satisfying the constraints above. Highlight additions, deletions, and replacements in a structured format. 24 Preprint. Under review. ...

work page
[18]

operational_rules

Operational rules( ∆ρop i ): Extract short-term corrective behaviors — specific, concrete instructions that directly address the observed failure pattern. These should be actionable in the agent’s very next execution. 2.Behavioral principles(∆ρ bp i ): Extract long-term strategic generalizations — higher-level guidance distilled from patterns across multi...

work page 2020
[19]

Under review

HotPotQA (Yang et al., 2018): a multi-hop QA dataset built from Wikipedia that requires models to reason over multiple documents while providing sentence-level supporting facts to enable explainable answer prediction 25 Preprint. Under review

work page 2018
[20]

Bamboogle (Press et al., 2023): a small, manually constructed QA dataset of 125 challeng- ing real-world multi-hop questions designed such that answers cannot be directly retrieved from search engines, requiring compositional reasoning across multiple pieces of evidence

work page 2023
[21]

MusiQue (Trivedi et al., 2022): a multi-hop QA dataset constructed by composing connected single-hop questions into 2–4 hop reasoning chains, explicitly designed to enforce genuine multi-step reasoning and reduce shortcut-based answering

work page 2022
[22]

HoVer (Jiang et al., 2020): a multi-hop fact verification dataset built from Wikipedia where models must retrieve evidence across 2–4 documents and determine whether a claim is supported or not, emphasizing complex many-hop reasoning and evidence extraction

work page 2020
[23]

Which university did the author ofThe Old Man and the Sea attend?

Ambig QA (Min et al., 2020): an open-domain QA dataset derived from NQ-open that focuses on ambiguous questions, requiring models to generate all plausible answers along with corresponding disambiguated question rewrites to explicitly resolve ambiguity. Table 2: Statistics of datasets. Dataset Train Val Test Total Reasoning 2WikiQA 154K 16K 22K 192K 2-hop...

work page 2020