Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition

Fanchen Yu; Lei Bai; Peng Ye; Shengji Tang; Tao Chen; Ting Liu; Wanli Ouyang; Wenzhen Yuan; Wutao Xiong; Yuzhuo Fu

arxiv: 2604.17009 · v1 · submitted 2026-04-18 · 💻 cs.AI

Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition

Wenzhen Yuan , Wutao Xiong , Fanchen Yu , Shengji Tang , Ting Liu , Tao Chen , Peng Ye , Yuzhuo Fu

show 2 more authors

Wanli Ouyang Lei Bai

This is my paper

Pith reviewed 2026-05-10 06:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsorchestrationparallel decompositionunified action spacelightweight orchestratorreinforcement learningprotocol normalizationtask delegation

0 comments

The pith

Treating agents and tools uniformly as actions enables a small model to orchestrate complex tasks through parallel decomposition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to solve the problems of high complexity and poor extensibility in multi-agent systems caused by heterogeneous interfaces and serial scheduling. It introduces the Agent-as-Tool paradigm to create a standardized action space for both agents and tools, complete with protocol normalization and state feedback. A lightweight orchestrator called ParaManager is then trained to handle planning separately from execution, supporting parallel subtask breakdown, delegation, and asynchronous running. The training uses supervised fine-tuning with built-in recovery and reinforcement learning to optimize several performance aspects. Readers should care if this makes building and scaling such systems simpler and more reliable.

Core claim

The paper claims that abstracting agents and tools into a unified learnable action space with normalization and feedback allows a small model to act as an effective master orchestrator. ParaManager separates the planning of subtasks from their solving, enabling state-aware parallel decomposition and asynchronous delegation. It is trained in two stages: first with SFT trajectories that include recovery mechanisms, then with RL to balance success rate, protocol adherence, output diversity, and efficiency. This leads to competitive results on benchmarks and the ability to generalize to previously unseen sets of models.

What carries the argument

The Agent-as-Tool paradigm that abstracts agents and tools into a standardized learnable action space with protocol normalization and state feedback, enabling ParaManager to perform parallel subtask decomposition and delegation.

If this is right

Parallel subtask decomposition and asynchronous execution become possible, avoiding serial bottlenecks.
Standardization of interfaces reduces system complexity and allows easier addition of new components.
The orchestrator generalizes robustly to unseen model pools.
The two-stage SFT and RL training balances task success with protocol compliance, diversity, and efficiency.
Recovery mechanisms during training enhance robustness to errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smaller models might handle orchestration duties more efficiently than larger ones in many scenarios.
The unified space could be adopted in other multi-component AI setups for better interoperability.
State feedback might support more adaptive replanning in dynamic environments.

Load-bearing premise

That standardizing agents and tools into a learnable action space with protocol normalization and state feedback sufficiently reduces system complexity and improves extensibility without new interface or coordination failures.

What would settle it

If ParaManager fails to orchestrate tasks when new agents or tools with non-standard protocols are introduced without retraining or adjustments on a standard benchmark.

Figures

Figures reproduced from arXiv: 2604.17009 by Fanchen Yu, Lei Bai, Peng Ye, Shengji Tang, Tao Chen, Ting Liu, Wanli Ouyang, Wenzhen Yuan, Wutao Xiong, Yuzhuo Fu.

**Figure 1.** Figure 1: Multi-round, state-driven parallel orchestration: ParaManager iteratively decomposes the query into subtasks, launches parallel agent/tool calls via a unified tool pool in each round, and cross-checks and consolidates intermediate results to produce the final answer. tion status σ. A single tool call returns a per-call observation o ≜ Tk(p) = (v, σ), σ ∈ {OK, PARSE ERR, EXEC ERR, TIMEOUT}. (1) ParaManager … view at source ↗

**Figure 2.** Figure 2: Two-stage training pipeline for ParaManager. SFT: sample multiple trajectories per instance, filter by correct final answer, error correction, and tool-balance, and retain one high-quality trajectory as supervision. RL: sample multiple rollouts, keep instances with mixed outcomes to ensure informative signals, and optimize ParaManager with rewards on accuracy, format, diversity, and efficiency. orchestrato… view at source ↗

**Figure 3.** Figure 3: The curve of changes in the average number of model turns in the RL stage with/without SFT initialization To ensure a fair comparison across methods, all baselines share the same tool suite and model pool, and we set the maximum generation length to 24,576 tokens and temperature of 1.0 for every model invocation. For each benchmark instance, we run 8 independent samples and report the mean accuracy (mean@… view at source ↗

**Figure 5.** Figure 5: Model and tool distribution of the untrained base model Qwen3-4B-Instruct-2507 and the trained ParaManager. The top row shows tool distribution, and the bottom row shows model distribution [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of training dynamics. (a) Shows the average accuracy improvement, while (b) illustrates the increase in tool usage diversity throughout the training process [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Case Study 1: Successful Multi-Agent Coordination. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Case Study 2: Conflict Resolution and Self-Correction. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: System prompt for the Critical Reviewer agent. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The full system prompt for the Central Orchestrator agent. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: System prompt for the Code Reasoner agent without direct tool execution. System Prompt: Knowledge Searcher # Role You are the **Knowledge Searcher** agent of a multi-agent problem-solving system. Your primary function is to search the information to solve the <user query>. # Input Data Structure The input you receive will contain the following two XML blocks: * <user query>: The specific problem or instru… view at source ↗

**Figure 12.** Figure 12: System prompt for the Knowledge Searcher agent focusing on query generation. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: System prompt for the Standard Reasoner agent. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: System prompt for the Final Answer Generator. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

read the original abstract

Multi-agent systems (MAS) demonstrate clear advantages in tackling complex problems by coordinating diverse agents and external tools. However, most existing orchestration methods rely on static workflows or serial agent scheduling, and are further constrained by heterogeneous interface protocols between tools and agents. This leads to high system complexity and poor extensibility. To mitigate these issues, we propose Agent-as-Tool, a unified parallel orchestration paradigm that abstracts both agents and tools into a standardized, learnable action space with protocol normalization and explicit state feedback. Building on this paradigm, we train a lightweight orchestrator, ParaManager, which decouples planning decisions from subtask solving, enabling state-aware parallel subtask decomposition, delegation, and asynchronous execution. For training, we adopt a two-stage ParaManager training pipeline. It improves robustness by incorporating supervised fine-tuning (SFT) trajectories equipped with recovery mechanisms, and further applies reinforcement learning (RL) to achieve an optimal balance among task success, protocol compliance, diversity, and reasoning efficiency. Experiments show that ParaManager achieves strong performance across multiple benchmarks and exhibits robust generalization under unseen model pools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is standardizing agents and tools into one learnable action space so a small model can handle parallel decomposition and delegation without static scripts or serial bottlenecks.

read the letter

The core idea here is treating both agents and tools as interchangeable actions with normalized protocols and state feedback, then training a lightweight ParaManager to break tasks into parallel subtasks. That decoupling of planning from execution is the practical part that could make multi-agent setups easier to extend than the usual fixed workflows or one-at-a-time scheduling. The two-stage pipeline—SFT with recovery trajectories followed by RL to tune success, compliance, diversity, and speed—gives a clear way to train the orchestrator without over-relying on any single objective. Those pieces together address real friction in current MAS work around heterogeneous interfaces and poor scalability. The proposal stays internally consistent and avoids obvious circularity in how it frames the training or the action space. What is less clear from the description is how much the standardization actually cuts down on new coordination failures once you move beyond the training distribution. The claim of strong benchmark results and generalization to unseen model pools is stated directly, but without the specific numbers, baselines, or ablation details it is hard to judge whether the gains are large enough to matter in practice or mostly incremental. This is the kind of paper that would interest people already building or evaluating agent systems who need a lighter orchestrator than full LLM planners. It is coherent enough on its own terms to warrant referee time rather than a desk reject, even if the experiments will need close review for reproducibility and effect size.

Referee Report

1 major / 3 minor

Summary. The manuscript proposes the Agent-as-Tool paradigm, which unifies agents and tools in multi-agent systems by abstracting them into a standardized, learnable action space equipped with protocol normalization and explicit state feedback. It introduces ParaManager, a lightweight orchestrator trained via a two-stage pipeline (SFT with recovery trajectories followed by RL) to perform state-aware parallel subtask decomposition, delegation, and asynchronous execution. The central claim is that this decouples planning from subtask solving, reduces system complexity, and yields strong benchmark performance with robust generalization to unseen model pools.

Significance. If the performance and generalization results hold, the work offers a practical route to more extensible and efficient MAS orchestration by replacing static/serial workflows with a unified, parallelizable action space and a small-model planner. The two-stage training that jointly optimizes success, compliance, diversity, and efficiency is a concrete contribution that could influence future agent-tool systems.

major comments (1)

[§4 (Experiments)] §4 (Experiments) and associated tables: the abstract asserts 'strong performance across multiple benchmarks' and 'robust generalization under unseen model pools,' yet the provided text supplies no numerical results, baseline comparisons, success rates, or error bars. Without these data the central empirical claim cannot be evaluated; please add the specific benchmarks, quantitative tables, and statistical details.

minor comments (3)

[Abstract] Abstract: name the concrete benchmarks (e.g., those appearing in §4) rather than referring generically to 'multiple benchmarks' so readers can immediately assess scope.
[Training Pipeline] Training pipeline description: clarify the precise form of the RL reward (or multi-objective combination) that balances the four stated factors; if an equation exists, reference it explicitly.
[§3 (Paradigm)] Notation: ensure consistent use of 'protocol normalization' and 'state feedback' across sections; a short glossary or diagram would aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the minor revision recommendation. The single major comment is addressed below.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: the abstract asserts 'strong performance across multiple benchmarks' and 'robust generalization under unseen model pools,' yet the provided text supplies no numerical results, baseline comparisons, success rates, or error bars. Without these data the central empirical claim cannot be evaluated; please add the specific benchmarks, quantitative tables, and statistical details.

Authors: We agree that the experimental claims require explicit quantitative support for evaluation. The current manuscript text does not include the requested numerical results, tables, baselines, success rates, or error bars. In the revised version we will expand §4 with the specific benchmarks used, full quantitative tables (including success rates, baseline comparisons, and statistical details such as error bars or variance where applicable), and explicit reporting of generalization results on unseen model pools. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces the Agent-as-Tool paradigm and ParaManager orchestrator, trained via a standard two-stage SFT+RL pipeline on trajectories with recovery mechanisms. No equations, derivations, or self-referential definitions appear that reduce performance claims or generalization results to inputs by construction. Training methods reference established SFT and RL techniques without fitted parameters or ansatzes defined circularly within the work. Empirical benchmark results and claims of robustness to unseen model pools rest on external evaluation rather than internal reduction to self-citations or renamed patterns. The derivation chain is self-contained with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, mathematical axioms, or new physical entities are described. The central contribution is the introduced Agent-as-Tool abstraction and associated training procedure.

pith-pipeline@v0.9.0 · 5521 in / 1099 out tokens · 37590 ms · 2026-05-10T06:46:59.319803+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
cs.CL 2026-05 unverdicted novelty 4.0

This survey organizes RL for LLM multi-agent systems into reward families, credit units, and five orchestration sub-decisions, notes the absence of explicit stopping-decision training in its paper pool, and releases a...

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Autoflow: Automated workflow generation for large language model agents

URL https://aclanthology.org/2025. emnlp-industry.144/. Li, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Huang, S., Rasul, K., Yu, L., Jiang, A. Q., Shen, Z., et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13(9):9, 2024a. Li, Z., Xu, S., Mei, K., ...

work page arXiv 2025
[2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

ISSN 0360-0300. doi: 10.1145/3774896. URL https://doi.org/10.1145/3774896. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98. Shao, Z., Wang, P., Zhu, Q....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3774896 2024
[3]

name": "code reasoner

URL https://aclanthology.org/2025. emnlp-main.93/. 11 Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition A. Detailed Training Configurations Tables 4 and 5 summarize the key hyperparameters for the SFT and RL stages, respectively. Table 4.Key Hyperparameters for SFT Phase Hyperparameter Value ...

work page 2025

[1] [1]

Autoflow: Automated workflow generation for large language model agents

URL https://aclanthology.org/2025. emnlp-industry.144/. Li, J., Beeching, E., Tunstall, L., Lipkin, B., Soletskyi, R., Huang, S., Rasul, K., Yu, L., Jiang, A. Q., Shen, Z., et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. Hugging Face repository, 13(9):9, 2024a. Li, Z., Xu, S., Mei, K., ...

work page arXiv 2025

[2] [2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

ISSN 0360-0300. doi: 10.1145/3774896. URL https://doi.org/10.1145/3774896. Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98. Shao, Z., Wang, P., Zhu, Q....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3774896 2024

[3] [3]

name": "code reasoner

URL https://aclanthology.org/2025. emnlp-main.93/. 11 Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition A. Detailed Training Configurations Tables 4 and 5 summarize the key hyperparameters for the SFT and RL stages, respectively. Table 4.Key Hyperparameters for SFT Phase Hyperparameter Value ...

work page 2025