UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Bin Zhang; Biqing Qi; Erhan Zhang; Haitao Li; Jiaxin Mao; Jinyuan Feng; Lingyong Yan; Qi Liu; Rui Li; Shijie Wang

UnityMAS-O turns manually orchestrated LLM multi-agent workflows into trainable multi-agent RL systems by treating the full workflow as the optimization unit.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-29 18:14 UTC pith:LXKEYMNJ

load-bearing objection UnityMAS-O is a practical verl extension adding multi-agent abstractions and a Ray runtime so workflows can be treated as RL units, but the abstract supplies zero numbers or baselines to judge the claimed gains. the 1 major comments →

arxiv 2605.26646 v1 pith:LXKEYMNJ submitted 2026-05-26 cs.AI cs.CLcs.MA

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

Yiqun Chen , Wei Yang , Erhan Zhang , Shijie Wang , Qi Liu , Zechun Niu , Bin Zhang , Haitao Li

show 9 more authors

Rui Li Lingyong Yan Jinyuan Feng Biqing Qi Xiaochi Wei Yan Gao Yi Wu Yao Hu Jiaxin Mao

This is my paper

classification cs.AI cs.CLcs.MA

keywords LLM multi-agent systemsmulti-agent reinforcement learningworkflow optimizationrole-based credit assignmentparameter sharingdistributed RL runtimePPO for agents

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most LLM-based multi-agent systems rely on hand-crafted prompts and rules, with little unified reinforcement learning to improve the agents. UnityMAS-O supplies four first-class objects—logical agent roles, graph trajectories, user-defined rewards, and agent-model mappings—to represent and optimize an entire workflow at once. These objects allow role-specific credit assignment and any combination of parameter sharing or separation across agents. A Ray-based star-topology runtime keeps workflow execution separate from distributed PPO-style model updates, so users can add new agents or rewards without altering the core training loop. Experiments on retrieval-augmented QA, iterative search, and code generation show measurable gains after optimization, especially for smaller models.

Core claim

By representing any LLM multi-agent workflow through logical agent roles, graph trajectories, user-defined rewards assigned at role/turn/trajectory levels, and flexible agent-model mappings, UnityMAS-O converts the workflow into a single optimization unit that supports full, partial, or no parameter sharing and delivers multi-agent RL training via a central controller and model-local workers.

What carries the argument

Four first-class objects (logical agent roles, graph trajectories, user-defined rewards, agent-model mappings) together with a Ray-based star-topology runtime that separates workflow control from rollout and distributed updates.

Load-bearing premise

That the four first-class objects and the Ray-based runtime can be implemented and used by practitioners without rewriting core optimization infrastructure.

What would settle it

A user defines a new multi-agent workflow (for example, a negotiation task) using only the four objects and the provided runtime, then checks whether the optimization loop runs without any changes to the training or buffering code.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Multi-agent RL improves performance over the original manually specified workflows on Natural Questions, HotpotQA, and code tasks.
Smaller models receive especially large gains, including on strict all-passed code metrics.
Rewards and credit can be assigned independently at role, turn, or full-trajectory levels.
The same infrastructure supports full sharing, full separation, or partial sharing of model parameters across logical agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same abstractions could let teams swap in different agent roles or reward functions on existing workflows with minimal additional engineering.
Because the runtime already records structured trajectories, the framework could support credit-assignment methods beyond PPO without changing the controller.
The decoupling of logical roles from physical models may make it easier to test heterogeneous model sizes within one multi-agent system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

UnityMAS-O is a practical verl extension adding multi-agent abstractions and a Ray runtime so workflows can be treated as RL units, but the abstract supplies zero numbers or baselines to judge the claimed gains.

read the letter

Colleague,

The main point is that UnityMAS-O extends verl with four first-class objects—logical roles, graph trajectories, user-defined rewards, and agent-model mappings—plus a central Ray controller for workflow execution and distributed PPO updates. This lets the entire multi-agent interaction be the optimization target instead of single policies, and users can add new roles or rewards without touching the training loop.

The design is straightforward and addresses a clear gap: most current RL post-training tools are single-agent only. The three instantiations on retrieval QA, iterative search, and reflective code generation show how the abstractions map to real tasks, and the claim of larger gains for smaller models and strict code metrics is at least directionally plausible.

The obvious weakness is that the abstract asserts measurable improvement on Natural Questions, HotpotQA, and held-out code tasks but gives no numbers, baselines, variance, or even basic experimental settings. Without those, it is impossible to tell whether the framework actually delivers usable gains or whether the runtime overhead is manageable. The soundness therefore rests entirely on the full paper’s experiments, which are not visible here.

This work is aimed at people already running verl or similar libraries who want to move multi-agent LLM pipelines from prompt engineering to end-to-end RL. A practitioner who needs the exact abstractions described would get immediate value from the code structure; a theorist looking for new algorithms would not.

It deserves peer review. The systems contribution is concrete and the gap it fills is real, even if the current evidence is thin. A referee can check whether the reported improvements hold up and whether the user extensibility claim survives actual use.

Referee Report

1 major / 0 minor

Summary. The paper presents UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. It models complete workflows as the optimization unit using four first-class objects (logical agent roles, graph trajectories, user-defined rewards, and agent-model mappings) that decouple logical agents from physical model parameters and support flexible credit assignment and parameter sharing. The framework extends verl with a Ray-based star-topology runtime separating a central controller from model-local workers. Three instantiations are described (retrieval-augmented QA, iterative agentic search, reflective code generation) with the claim that multi-agent RL yields improvements over manually specified workflows on Natural Questions, HotpotQA, and held-out code tasks, especially for smaller models and strict code metrics, positioning UnityMAS-O as a reusable substrate for trainable multi-agent RL systems.

Significance. If the experimental results hold with proper validation, the work would be significant for providing the first unified RL interface that treats multi-agent workflows as first-class optimization units rather than single-policy trajectories. The abstractions for role-specific rewards, graph trajectories, and configurable model sharing address a clear gap in existing single-agent RL post-training frameworks and could enable broader adoption of multi-agent RL for LLM systems.

major comments (1)

[Abstract] Abstract: the central claim that 'multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics' is asserted without any quantitative results, baselines, error bars, statistical tests, or experimental details. This directly undermines evaluation of whether the three instantiations support the reusable-substrate conclusion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address the major comment on the abstract below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics' is asserted without any quantitative results, baselines, error bars, statistical tests, or experimental details. This directly undermines evaluation of whether the three instantiations support the reusable-substrate conclusion.

Authors: We agree that the abstract as written makes a qualitative claim without supporting quantitative details, which limits the ability to evaluate the strength of the evidence. The full paper contains experimental results across the three instantiations (retrieval-augmented QA, iterative agentic search, and reflective code generation) on Natural Questions, HotpotQA, and held-out code tasks, including comparisons to manually specified workflows. In the revised version, we will update the abstract to include specific quantitative highlights (e.g., relative improvements, particularly for smaller models and strict metrics) drawn from the results section, while keeping the abstract concise. We will also ensure the abstract references the experimental setup at a high level. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a systems framework for RL optimization of LLM-based multi-agent workflows using four first-class objects (roles, trajectories, rewards, mappings) and a Ray-based runtime extension to verl. No mathematical derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on empirical results from three instantiations on external benchmarks (Natural Questions, HotpotQA, code tasks), with no self-citation chains or ansatzes invoked as load-bearing. The contribution is self-contained as an extensible substrate without circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software framework paper; the abstract introduces no mathematical free parameters, axioms, or invented scientific entities.

pith-pipeline@v0.9.1-grok · 5893 in / 1079 out tokens · 39579 ms · 2026-06-29T18:14:32.904464+00:00 · methodology

0 comments

read the original abstract

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

Figures

Figures reproduced from arXiv: 2605.26646 by Bin Zhang, Biqing Qi, Erhan Zhang, Haitao Li, Jiaxin Mao, Jinyuan Feng, Lingyong Yan, Qi Liu, Rui Li, Shijie Wang, Wei Yang, Xiaochi Wei, Yan Gao, Yao Hu, Yiqun Chen, Yi Wu, Zechun Niu.

**Figure 2.** Figure 2: Distributed training architecture of UnityMAS-O. A central controller executes [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Workflow templates used in the experiments. The figure shows three search [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of multi-agent RL training on QA validation F1 across datasets, workflows, [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: HotpotQA M-ASK validation F1 for a 3B shared-parameter setting and a 4x3B [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics of the Reflective Verification Loop code workflow up to step [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Average number of verification turns used by the Reflective Verification Loop [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MIRROR: Learning from the Other View for Multi-Modal Reasoning
cs.AI 2026-07 conditional novelty 6.0

An RL method that selects the best-performing view of each geometry problem as an internal teacher and distills it into weaker views improves VLM reasoning accuracy and consistency.
SearchArt: Training Long-Horizon Search Agent with Scalable Synthetic and Verified Task
cs.IR 2026-07 conditional novelty 5.0

SearchArt post-trains Qwen3.5-27B on verification-filtered synthetic search trajectories, scoring 74.39 on BrowseComp-ZH, 70.06 on BrowseComp, and 52.55 on DeepResearch-Bench, competitive with several 200B-700B agents.

Reference graph

Works this paper leans on

4 extracted references · cited by 2 Pith papers

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...