From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Brian Riordan; Kishan Athrey; Mahesh Viswanathan; Ramin Pishehvar

arxiv: 2605.03986 · v1 · submitted 2026-05-05 · 💻 cs.AI

From Intent to Execution: Composing Agentic Workflows with Agent Recommendation

Kishan Athrey , Ramin Pishehvar , Brian Riordan , Mahesh Viswanathan This is my paper

Pith reviewed 2026-05-07 16:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsagent recommendationLLM plannerinformation retrievalcritique agentworkflow automationtask completionAI agents

0 comments

The pith

A framework with LLM planning and two-stage agent recommendation automates the composition of multi-agent AI workflows from user intents and raises recall over prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework that takes a user intent and automatically generates a plan with an LLM, describes tasks in natural language, builds a dynamic call graph, and maps agents to those tasks. The core module is an agent recommender that first uses a fast retriever to pull candidates from local and global registries, then applies an LLM re-ranker for better matches, with an optional critique agent that reviews the selections against the full plan. Experiments test variations in embedders, re-rankers, and the critique step, then measure end-to-end success in planning, selection, and task completion. The results show higher recall rates and improved robustness and scalability compared with earlier approaches.

Core claim

The authors establish that integrating an LLM-derived planner, natural-language task descriptions, a dynamic call graph, an orchestrator, and an agent recommender built on two-stage information retrieval plus a supervising critique agent produces multi-agent systems from user intents with higher recall in agent-task matching and overall task completion than previous manual or less automated methods.

What carries the argument

The agent recommender, which uses a fast retriever for initial candidates from agent registries followed by an LLM-based re-ranker, optionally supervised by a critique agent that reevaluates selections against the overall plan.

If this is right

The approach yields higher recall rates than state-of-the-art methods for selecting agents that fulfill planned tasks.
The system is more robust and scalable than prior manual composition methods.
Adding the critique agent further raises the recall score by reviewing selections against the full plan.
Multiple manual steps in building multi-agent systems are replaced by automated planning and agent mapping.
Agents from both local and global registries can be dynamically matched to tasks in an execution graph.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Non-expert users could prototype complex agent-based applications more quickly without hand-picking every component.
The same recommendation pattern might extend to selecting tools or models in other automated pipelines.
Real-time monitoring could trigger re-recommendation of agents if a task fails during execution.
Performance on larger, domain-specific agent pools would provide a direct test of the claimed scalability.

Load-bearing premise

The fast retriever, LLM re-ranker, and critique agent will reliably pick agents that can actually complete the planned tasks when the system runs on new intents outside the tested cases.

What would settle it

Apply the full framework to a new collection of user intents never seen in the experiments and measure the fraction of tasks that the selected agents successfully complete.

Figures

Figures reproduced from arXiv: 2605.03986 by Brian Riordan, Kishan Athrey, Mahesh Viswanathan, Ramin Pishehvar.

**Figure 1.** Figure 1: Architecture for an end-to-end MAS with dynamic and redundant workflow view at source ↗

**Figure 2.** Figure 2: Agent Recommender that includes the enrich, retrieve, and re-rank phases. view at source ↗

**Figure 3.** Figure 3: Agent description enrichment by appending synthetic queries generated from the agent de view at source ↗

**Figure 4.** Figure 4: ReAct Style Architecture [17] For Planning/Composition. The Planner generates sub tasks and view at source ↗

**Figure 5.** Figure 5: ReAct Style Planning architecture with integrated critique loop. The agent node generates view at source ↗

read the original abstract

Multi-Agent Systems (MAS) built using AI agents fulfill a variety of user intents that may be used to design and build a family of related applications. However, the creation of such MAS currently involves manual composition of the plan, manual selection of appropriate agents, and manual creation of execution graphs. This paper introduces a framework for the automated creation of multi-agent systems which replaces multiple manual steps with an automated framework. The proposed framework consists of software modules and a workflow to orchestrate the requisite task- specific application. The modules include: an LLM-derived planner, a set of tasks described in natural language, a dynamic call graph, an orchestrator for map agents to tasks, and an agent recommender that finds the most suitable agent(s) from local and global agent registries. The agent recommender uses a two-stage information retrieval (IR) system comprising a fast retriever and an LLM-based re-ranker. We implemented a series of experiments exploring the choice of embedders, re- rankers, agent description enrichment, and supervising critique agent. We benchmarked this system end-to-end, evaluating the combination of planning, agent selection, and task completion, with our proposed approach. Our experimental results show that our approach outperforms the state-of-the- art in terms of the recall rate and is more robust and scalable compared to previous approaches. The critique agent holistically reevaluates both agent and tool recommendations against the overall plan. We show that the inclusion of the critique agent further enhances the recall score, proving that the comprehensive review and revision of task-based agent selection is an essential step in building end-to-end multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a concrete automation pipeline for multi-agent workflows but its claims of superior recall and robustness rest on experiments whose strength is unclear without the numbers.

read the letter

This paper describes a framework that automates the composition of multi-agent systems from user intent, using an LLM planner, dynamic call graph, two-stage agent recommender, and a critique agent. The main practical advance is replacing manual agent selection and graph building with this pipeline, and they test variations like different embedders and the effect of the critique step. It does well in outlining a complete workflow and showing that the critique agent improves recall by reviewing recommendations against the overall plan. The integration of the dynamic call graph for execution is a solid engineering choice. The soft spots are in the evaluation. The abstract claims better recall than state-of-the-art and greater robustness and scalability, but without seeing the specific numbers, baselines, or how they measured task completion end-to-end, it's difficult to tell how much of an improvement it really is. The concern about whether the recall on agent selection holds up for actual task execution with external or noisy agent descriptions seems valid based on the description; the experiments may be too limited to support the broader claims. This is the kind of paper that would interest people working on building reliable agentic applications, as it offers a blueprint for reducing manual effort. A reader in the agentic workflows subfield could pick up useful implementation ideas even if the results need more validation. It deserves a serious referee because the core idea is sound and addresses a real pain point, though the authors should be prepared for questions on the experimental design and generalization. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces a framework for automated composition of multi-agent systems (MAS) from user intents. It replaces manual planning, agent selection, and graph construction with an LLM-derived planner that generates natural-language tasks, a two-stage information retrieval recommender (fast retriever plus LLM re-ranker) that selects agents from local and global registries, an orchestrator that builds dynamic call graphs, and a critique agent that holistically reviews recommendations against the overall plan. End-to-end experiments on embedder/re-ranker choices, description enrichment, and the critique agent are reported to show higher recall than prior approaches, with the critique step providing further gains, plus claims of improved robustness and scalability.

Significance. If the empirical claims are substantiated with detailed, reproducible results, the integration of a critique agent for plan-level review and the two-stage IR recommender could meaningfully reduce manual effort in MAS construction. The end-to-end evaluation of planning plus selection plus task completion is a positive design choice. However, the absence of any numerical recall values, baseline methods, dataset descriptions, error bars, or statistical tests makes it impossible to assess whether the reported outperformance and robustness claims actually hold, limiting the work's current significance.

major comments (3)

[Abstract and experimental results] Abstract and experimental results section: the central claim that the approach 'outperforms the state-of-the-art in terms of the recall rate' and is 'more robust and scalable' is asserted without any quantitative metrics, baseline names, dataset sizes, number of trials, or statistical tests. This directly undermines evaluation of the headline empirical result.
[Experimental evaluation] Experimental evaluation (end-to-end benchmarking): recall is measured on agent selection from (apparently fixed or internal) registries, yet the robustness and scalability assertions require evidence that selected agents actually complete the planned tasks when agent descriptions are noisy, incomplete, or drawn from external public pools. No such out-of-distribution execution-success results are provided, so the measured recall does not support the broader claims.
[Agent recommender and critique agent] Agent recommender description: the two-stage IR system (fast retriever + LLM re-ranker) plus critique agent is presented as the key technical contribution, but no ablation isolating the contribution of each stage, no details on how the re-ranker prompt is constructed, and no analysis of failure modes when the fast retriever returns poor candidates are given. These omissions are load-bearing for the scalability argument.

minor comments (2)

[Abstract] The abstract states that 'a series of experiments exploring the choice of embedders, re-rankers, agent description enrichment, and supervising critique agent' were run, yet no tables, figures, or quantitative outcomes from these ablations are referenced or summarized.
[Framework overview] Notation for the orchestrator and dynamic call graph is introduced but never formalized; a short pseudocode or diagram would clarify how tasks are mapped to agents at runtime.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for improving the clarity and substantiation of our experimental claims and technical contributions. We agree that additional quantitative details, ablations, and supporting evaluations are needed to fully support the claims of outperformance, robustness, and scalability. We will revise the manuscript to address these points comprehensively. Our point-by-point responses to the major comments are provided below.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: the central claim that the approach 'outperforms the state-of-the-art in terms of the recall rate' and is 'more robust and scalable' is asserted without any quantitative metrics, baseline names, dataset sizes, number of trials, or statistical tests. This directly undermines evaluation of the headline empirical result.

Authors: We acknowledge that while the manuscript describes a series of experiments on embedder and re-ranker choices, description enrichment, and the critique agent, and states that the approach outperforms prior methods with higher recall, the specific numerical recall values, baseline method names, dataset sizes, number of trials, error bars, and statistical tests are not reported in the current text. In the revised manuscript, we will expand the experimental results section to include these details, such as tables showing recall rates (with and without the critique agent), identification of the baselines used, descriptions of the datasets and their sizes, trial counts, and appropriate statistical validation to substantiate the outperformance and robustness claims. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (end-to-end benchmarking): recall is measured on agent selection from (apparently fixed or internal) registries, yet the robustness and scalability assertions require evidence that selected agents actually complete the planned tasks when agent descriptions are noisy, incomplete, or drawn from external public pools. No such out-of-distribution execution-success results are provided, so the measured recall does not support the broader claims.

Authors: The manuscript does report end-to-end benchmarking that combines planning, agent selection, and task completion, with the critique agent providing holistic review. However, we agree that the evaluation of recall is based on the registries used in our controlled experiments and does not include explicit results on task completion success with noisy/incomplete descriptions or external public agent pools. In the revision, we will add experiments evaluating execution success rates under such out-of-distribution conditions to better support the robustness and scalability assertions. revision: yes
Referee: [Agent recommender and critique agent] Agent recommender description: the two-stage IR system (fast retriever + LLM re-ranker) plus critique agent is presented as the key technical contribution, but no ablation isolating the contribution of each stage, no details on how the re-ranker prompt is constructed, and no analysis of failure modes when the fast retriever returns poor candidates are given. These omissions are load-bearing for the scalability argument.

Authors: We agree that the current description of the two-stage recommender and critique agent would benefit from greater detail to support the scalability claims. In the revised manuscript, we will add ablation studies isolating the contributions of the fast retriever, LLM re-ranker, and critique agent to overall performance. We will also include the specific prompt construction for the re-ranker and an analysis of failure modes (e.g., poor candidate retrieval by the fast retriever) along with how the re-ranker and critique agent mitigate them. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework without derivations or self-referential predictions

full rationale

The paper introduces a software framework for automated MAS composition consisting of an LLM planner, task descriptions, dynamic call graph, orchestrator, and a two-stage IR agent recommender (fast retriever + LLM re-ranker) plus optional critique agent. All central claims concern experimental outcomes: higher recall rates than SOTA, improved robustness/scalability, and further gains from the critique agent when evaluating planning + selection + task completion end-to-end. No equations, first-principles derivations, fitted parameters, or uniqueness theorems appear in the provided text or abstract. Claims rest on described benchmarks rather than any step that reduces by construction to its own inputs, self-citations, or ansatzes. The evaluation is therefore self-contained as standard empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Framework rests on standard assumptions about LLM planning and retrieval quality rather than new axioms or fitted constants.

axioms (1)

domain assumption LLMs can produce usable task plans and effective critique of agent selections from natural language descriptions
Invoked throughout the planner, recommender, and critique modules.

pith-pipeline@v0.9.0 · 5605 in / 1172 out tokens · 41635 ms · 2026-05-07T16:13:16.998603+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

W. Chen, W. Li, D. Yao, X. Meng, C. Gong, and J. Bi. GTool: Graph enhanced tool planning with large language model.Proceedings of the International Conference on Machine Learning (ICML), 2026

work page 2026
[2]

Re-invoke: Tool invocation rewriting for zero- shot tool retrieval,

Y. Chen et al. Re-invoke: Tool invocation rewriting for zero-shot tool retrieval.arXiv preprint arXiv:2408.01875, 2024

work page arXiv 2024
[3]

DAT: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation

Hsin-Ling Hsu. DAT: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation. arXiv preprint arXiv:2503.23013, 2025

work page arXiv 2025
[4]

Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

Y. Huang, J. Shi, Y. Li, C. Fan, S. Wu, Q. Zhang, Y. Liu, P. Zhou, Y. Wan, N. Z. Gong, and L. Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023
[5]

Jia and Q

J. Jia and Q. Li. AutoTool: Efficient tool selection for large language model agents.Proceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[6]

S. H. Lim, N. Schick, S. Ebrahimi, A. Veit, and C. Xing. ITR-RAG: Iterative tuning for retrieval- augmented generation.arXiv preprint arXiv:2406.17465, 2024

work page arXiv 2024
[7]

Graph RAG -Tool Fusion

E. Lumer, P. H. Basavaraju, M. Mason, J. A. Burke, and V. K. Subbiah. Graph RAG-tool fusion. arXiv preprint arXiv:2502.07223, 2025

work page arXiv 2025
[8]

text-embedding-3-large.https://platform.openai.com/docs/models/ text-embedding-3-large

OpenAI. text-embedding-3-large.https://platform.openai.com/docs/models/ text-embedding-3-large

work page
[9]

S. Shen, K. Song, X. Tan, W. Zhang, K. Ren, S. Yuan, W. Lu, D. Li, and Y. Zhuang. TaskBench: Benchmarking large language models for task automation.arXiv preprint arXiv:2311.18760, 2024

work page arXiv 2024
[10]

Z. Shi, Y. Wang, L. Yan, P. Ren, S. Wang, D. Yin, and Z. Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025. 13

work page 2025
[11]

Z. Shi, L. Yan, W. Sun, Y. Feng, P. Ren, X. Ma, S. Wang, D. Yin, M. de Rijke, and Z. Ren. Direct retrieval-augmented optimization: Synergizing knowledge selection and language models.arXiv preprint arXiv:2505.03075, 2025

work page arXiv 2025
[12]

R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li. ToolGen: Unified tool retrieval and calling via generation. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[13]

S. Wang, B. Liu, S. Li, and W. Hou. ReAGT: Retrieval-augmented generation with trustworthiness for knowledge-intensive tasks. InProceedings of the KnowLLM Workshop, 2025

work page 2025
[14]

S. Wang, Z. Tan, Z. Chen, S. Zhou, T. Chen, and J. Li. AnyMAC: Cascading flexible multi-agent collaboration via next-agent prediction.Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

work page 2025
[15]

Weaviate vector database.https://weaviate.io

Weaviate. Weaviate vector database.https://weaviate.io

work page
[16]

X. Wei, Y. Dong, X. Wang, X. Zhang, Z. Zhao, D. Shen, L. Xia, and D. Yin. Beyond Re- Act: A planner-centric framework for complex tool-augmented LLM reasoning.arXiv preprint arXiv:2511.10037, 2025

work page arXiv 2025
[17]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review arXiv 2022
[18]

Zhang, X

Y. Zhang, X. Liu, and C. Xiao. MetaAgent: Automatically constructing multi-agent systems based on finite state machines.Proceedings of the International Conference on Machine Learning (ICML), 2025. 14

work page 2025

[1] [1]

W. Chen, W. Li, D. Yao, X. Meng, C. Gong, and J. Bi. GTool: Graph enhanced tool planning with large language model.Proceedings of the International Conference on Machine Learning (ICML), 2026

work page 2026

[2] [2]

Re-invoke: Tool invocation rewriting for zero- shot tool retrieval,

Y. Chen et al. Re-invoke: Tool invocation rewriting for zero-shot tool retrieval.arXiv preprint arXiv:2408.01875, 2024

work page arXiv 2024

[3] [3]

DAT: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation

Hsin-Ling Hsu. DAT: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation. arXiv preprint arXiv:2503.23013, 2025

work page arXiv 2025

[4] [4]

Metatool benchmark for large language models: Deciding whether to use tools and which to use.arXiv preprint arXiv:2310.03128, 2023

Y. Huang, J. Shi, Y. Li, C. Fan, S. Wu, Q. Zhang, Y. Liu, P. Zhou, Y. Wan, N. Z. Gong, and L. Sun. MetaTool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023

work page arXiv 2023

[5] [5]

Jia and Q

J. Jia and Q. Li. AutoTool: Efficient tool selection for large language model agents.Proceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025

[6] [6]

S. H. Lim, N. Schick, S. Ebrahimi, A. Veit, and C. Xing. ITR-RAG: Iterative tuning for retrieval- augmented generation.arXiv preprint arXiv:2406.17465, 2024

work page arXiv 2024

[7] [7]

Graph RAG -Tool Fusion

E. Lumer, P. H. Basavaraju, M. Mason, J. A. Burke, and V. K. Subbiah. Graph RAG-tool fusion. arXiv preprint arXiv:2502.07223, 2025

work page arXiv 2025

[8] [8]

text-embedding-3-large.https://platform.openai.com/docs/models/ text-embedding-3-large

OpenAI. text-embedding-3-large.https://platform.openai.com/docs/models/ text-embedding-3-large

work page

[9] [9]

S. Shen, K. Song, X. Tan, W. Zhang, K. Ren, S. Yuan, W. Lu, D. Li, and Y. Zhuang. TaskBench: Benchmarking large language models for task automation.arXiv preprint arXiv:2311.18760, 2024

work page arXiv 2024

[10] [10]

Z. Shi, Y. Wang, L. Yan, P. Ren, S. Wang, D. Yin, and Z. Ren. Retrieval models aren’t tool-savvy: Benchmarking tool retrieval for large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2025. 13

work page 2025

[11] [11]

Z. Shi, L. Yan, W. Sun, Y. Feng, P. Ren, X. Ma, S. Wang, D. Yin, M. de Rijke, and Z. Ren. Direct retrieval-augmented optimization: Synergizing knowledge selection and language models.arXiv preprint arXiv:2505.03075, 2025

work page arXiv 2025

[12] [12]

R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li. ToolGen: Unified tool retrieval and calling via generation. InProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[13] [13]

S. Wang, B. Liu, S. Li, and W. Hou. ReAGT: Retrieval-augmented generation with trustworthiness for knowledge-intensive tasks. InProceedings of the KnowLLM Workshop, 2025

work page 2025

[14] [14]

S. Wang, Z. Tan, Z. Chen, S. Zhou, T. Chen, and J. Li. AnyMAC: Cascading flexible multi-agent collaboration via next-agent prediction.Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025

work page 2025

[15] [15]

Weaviate vector database.https://weaviate.io

Weaviate. Weaviate vector database.https://weaviate.io

work page

[16] [16]

X. Wei, Y. Dong, X. Wang, X. Zhang, Z. Zhao, D. Shen, L. Xia, and D. Yin. Beyond Re- Act: A planner-centric framework for complex tool-augmented LLM reasoning.arXiv preprint arXiv:2511.10037, 2025

work page arXiv 2025

[17] [17]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review arXiv 2022

[18] [18]

Zhang, X

Y. Zhang, X. Liu, and C. Xiao. MetaAgent: Automatically constructing multi-agent systems based on finite state machines.Proceedings of the International Conference on Machine Learning (ICML), 2025. 14

work page 2025