pith. sign in

arxiv: 2512.04388 · v5 · submitted 2025-12-04 · 💻 cs.LG

Learning to Orchestrate Agents in Natural Language with the Conductor

Pith reviewed 2026-05-17 01:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords conductor modelLLM coordinationreinforcement learningmulti-agent systemsreasoning benchmarksprompt engineeringcommunication topology
0
0 comments X

The pith

A 7B Conductor model trained with reinforcement learning can orchestrate multiple LLMs to outperform any single model on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Conductor model that learns through reinforcement learning to coordinate groups of language models. It figures out how they should communicate and what specific instructions to give each one. This allows even a relatively small 7 billion parameter Conductor to guide stronger worker models toward better answers on hard problems like coding and science questions. The approach works with any mix of open and closed source models and can even include itself in the team for iterative improvement.

Core claim

By training a Conductor with reinforcement learning to automatically discover coordination strategies, including communication topologies and focused prompts for worker LLMs, a 7B model achieves performance gains beyond any individual worker and attains state-of-the-art results on benchmarks such as LiveCodeBench and GPQA. Training with randomized agent pools enables adaptation to arbitrary sets of agents, and allowing self-selection creates recursive topologies for dynamic test-time scaling.

What carries the argument

The Conductor, a model trained via reinforcement learning to design agent communication topologies and engineer prompts for optimal collaboration among worker LLMs.

If this is right

  • Significant performance gains on challenging reasoning benchmarks beyond single worker models.
  • State-of-the-art results on LiveCodeBench and GPQA.
  • Adaptation to arbitrary pools of open- and closed-source agents.
  • Support for recursive topologies through self-selection, enabling online iterative adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Coordination learned this way might scale better than increasing the size of individual models alone.
  • Users could dynamically choose worker sets based on cost or capability without retraining the orchestrator.
  • Recursive self-inclusion suggests a path to test-time compute scaling without fixed architectures.

Load-bearing premise

The performance improvements come primarily from the learned coordination and prompting strategies rather than from the particular choice of worker models or other unmentioned factors.

What would settle it

Measuring performance when the Conductor uses random or fixed coordination strategies instead of learned ones; if gains disappear, the claim holds.

Figures

Figures reproduced from arXiv: 2512.04388 by Edoardo Cetin, Jinglue Xu, Peter Schwendeman, Qi Sun, Stefan Nielsen, Yujin Tang.

Figure 1
Figure 1. Figure 1: Our Conductor attains the state-of￾the-art in GPQA and LiveCodeBench. Through unprecedented scale and engineer￾ing effort, modern Large Language Models (LLMs) (Anthropic, 2025; OpenAI, 2025; 2023; Team et al., 2023) demonstrate the ability to solve formidably complex tasks, with performance even approaching that of top human experts (Luong & Lockhart, 2025). These remarkable latent capabil￾ities are essent… view at source ↗
Figure 2
Figure 2. Figure 2: The Conductor output. The Conductor responds with the entire coordination strategy. for each question q ∈ D. Then, for β ≥ 0 and a KL-divergence penalty to the reference model DKL(·∥ πref), the optimization objective is given by the KL-discounted policy maximization: J(θ) = Eq∼D, {o}G 1 ∼πθ(·|q) " 1 G X G i=1  min riAi , clip(ri , 1 − ϵ, 1 + ϵ) Ai  − β DKL(πθ ∥ πref)  # , (1) using the grouped completio… view at source ↗
Figure 3
Figure 3. Figure 3: Emergence of powerful coordination strategies over training. Early in training, the Conductor issues sound subtasks, but does not tap useful collaborative strategies such as verification (bottom-right). Near convergence, the Conductor has learned to utilize planners, issue targeted instructions, instruct workers to share reasoning, and leverage verification and refinement (top-right), leading to the Conduc… view at source ↗
Figure 4
Figure 4. Figure 4: Conductor in-distribution evaluation against multi-agent methods and 5-turn reflec￾tion agent baselines. The Conductor surpasses all baselines by substantive margins, exemplifying the Conductor’s ability to amplify the capabilities of its workers. Numerical results in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance vs Efficiency. The Con￾ductor far surpasses multi-agent baselines at a fraction of the cost. Scores are task-averages from [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Finetuned on randomized model pools, the Conductor achieves strong performance over rarely used open-model subsets while maintaining performance on the closed-model subsets. Dynamic worker pool. We evaluate our Con￾ductor finetuned on randomized model subsets and compare it with its pre-trained counterpart, which was always given full access to all mod￾els in our original set. In particular, we fo￾cus on t… view at source ↗
Figure 7
Figure 7. Figure 7: Conductor Scale. The 3B Conductor still learns optimal agent selection, as shown by the agent distribution converging on the three most powerful models (left). However, when scaling to 7B, the Conductor generates additional performance gains, even for identical agent selection, through its improved prompt engineering (right). Evaluation performance taken from [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task adaptivity. In more straightforward tasks, such as MMLU, the Conductor learns that 2 agents working together is optimal. In more complex settings, such as LiveCodeBench, the Conductor allocates more compute by devising coordination strategies with 3 or even 4 agents. converge to select the same distribution of worker agents. However, while both of our models still display performance well beyond all o… view at source ↗
Figure 9
Figure 9. Figure 9: OOD few-shot examples improve Con￾ductor performance. Subtasks. We present results of the abla￾tion in [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Recursive Conductor worker dis￾tribution on BigCodeBench. The Conduc￾tor redistributes its agent selection towards Claude and Gemini in recursive rounds, re￾flecting their superior performance. We present in [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Conductor schematic visualization. The Conductor combines the differing specializa￾tions of the workers LLMs to answer complex user queries. Here we visualize a workflow bridging both mathematical reasoning and English-Chinese translation. E CONDUCTOR PROMPT AND FEW-SHOT EXAMPLES [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Recursive Conductor visualization. At test time, the Conductor is able to adapt its intial coordination strategies on-the-fly. We show in [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The Conductor prompt. Our Conductor prompt instructs the Conductor with the re￾quired format for its output to be parseable as a complete coordination strategy Your role as an assistant involves obtaining answers to questions by an iterative process of querying powerful language models, each with a different skillset. You are given a user-provided question and a list of available numbered language models … view at source ↗
Figure 14
Figure 14. Figure 14: The recursive prompt. When allowing self-referential recursion, the Conductor views the response obtained from its previously designed coordination strategy and decides whether to iterate on its strategy or pass the existing response back to the user. Here is the final response obtained at the end of your routing steps: {worker response} You now have a chance to correct or improve this response by outputt… view at source ↗
Figure 15
Figure 15. Figure 15: Example 3B Conductor completion. The 3B model provides a workable strategy, but suboptimally instructs the first model to hide their reasoning due to the constraint provided by the user, impairing collaboration. User Solve the following math problem step by step: Let F(z) = z+i z−i for all complex numbers z ̸= i, and let zn = F(zn−1) for all positive integers n. Given that z0 = 1 137 + i, find z2002. For … view at source ↗
Figure 16
Figure 16. Figure 16: In-distribution few-shot examples. The two few-shot examples provided are taken from MATH500 and MMLU. EXAMPLE 1: Question: Subject: Physics. A converging lens is placed 30.0 cm to the right of a diverging lens of focal length 10.0 cm. A beam of parallel light enters the diverging lens from the left, and the beam is again parallel when it emerges from the converging lens. Calculate the focal length of the… view at source ↗
Figure 17
Figure 17. Figure 17: Out-of-distribution few-shot examples. The two few-shot examples provided are taken from deepmath and medreason, which are out-of-distribution relative to the training tasks of MATH500, MMLU, RLPR, and LiveCodeBench. EXAMPLE 1: Question: Does brain-derived neurotrophic factor enhance the contraction of intestinal mus￾cle strips induced by SP and CGRP in mice? Answer Choices: A. Yes B. No Assistant Respons… view at source ↗
Figure 18
Figure 18. Figure 18: Example Conductor completion with a 1-shot strategy. In the following example, the Conductor recognizes the simplicity of the question and directly solves it with a single model. User A heartbroken dog whose owner died two months ago is missing her so much that he attends services every day at the Italian church where her funeral was held, patiently waiting for her to return. Tommy, a seven-year-old dog, … view at source ↗
Figure 19
Figure 19. Figure 19: Example Conductor completion with a tree topology. The Conductor solves the prob￾lem with two independent steps followed by an aggregation step, specified by its access list of [[], [], ["all"]]. User One ticket to a show costs $20 at full price. Susan buys 4 tickets using a coupon that gives her a 25% discount. Pam buys 5 tickets using a coupon that gives her a 30% discount. How many more dollars does Pa… view at source ↗
Figure 20
Figure 20. Figure 20: Example Conductor completion allocating more agents to solve a harder problem. In the following LiveCodeBench example, the Conductor devises an extensive workflow utilizing all 5 possible workflow steps. User You are given a weighted simple connected undirected graph with N vertices and M edges, where vertices are numbered 1 to N, and edges are numbered 1 to M. Additionally, a positive integer K is given.… view at source ↗
Figure 21
Figure 21. Figure 21: Example Conductor-recursive completion, determining that the original coordina￾tion strategy was already sound. User You are given a string S of length N consisting of lowercase English letters. If there are any adjacent occurrences of a and b in S, print Yes; otherwise, print No. (The order of a and b does not matter.) Conductor 1. First, we’ll directly attempt to solve the problem. 2. Next, we’ll check … view at source ↗
Figure 22
Figure 22. Figure 22: Example Conductor-recursive completion, determining allocating additional agents to provide feedback and verify in the recursion round. User You are given a string S of length N consisting of lowercase English letters. If there are any adjacent occurrences of a and b in S, print Yes; otherwise, print No. (The order of a and b does not matter.) Conductor 1. First, we’ll directly attempt to solve the proble… view at source ↗
Figure 23
Figure 23. Figure 23: Utilizing a planner and coder in LiveCodeBench. In this LiveCodeBench problem, the Conductor leverages Gemini 2.5 Pro to first act as a strategist and then GPT to act as a coder. User There are N cities in a certain country. You will travel from your office in city 1 to a destination in city N, via zero or more cities. Two types of transportation are available: company car and train. The time required to … view at source ↗
Figure 24
Figure 24. Figure 24: Comprehending, planning, strategizing, and finally coding with 3 agents on Live￾CodeBench. In this LiveCodeBench problem, the Conductor leverages Claude Sonnet 4 to first understand the objective and constraints, then Gemini 2.5 Pro strategizes and plans, before GPT-5 implements the final code. User You are given a 0-indexed integer array nums representing the initial positions of some marbles. You are al… view at source ↗
Figure 25
Figure 25. Figure 25: Understanding the method, analyzing the data, then solving in MMLU. In this MMLU problem, the Conductor leverages Gemini 2.5 Pro and Claude Sonnet 4 to analyze the data and methods, before handing over to GPT-5 to aggregate their work into the final response. User Flying over a desert in an airplane, two scientists looked down with trained eyes at trees and bushes. After an hour’s flight one of the scient… view at source ↗
Figure 26
Figure 26. Figure 26: Factual recall makes use of tree topology. In this Medreason problem, the Conductor explicitly recognizes that agent-to-agent collaboration is unnecessary, and relies instead on querying models independently and aggregating to solve the answer. User Does brain derived neurotrophic factor enhance the contraction of intestinal muscle strips induced by SP and CGRP in mice? Conductor The user question is abou… view at source ↗
Figure 27
Figure 27. Figure 27: Small models can help format check. In this BigCodeBench problem, we noticed GPT￾5 making frequent formatting errors, providing additional information and hyerparameter choices in the dataframe title. Employing Qwen-32B as a final format checker helped solve this problem and raise GPT-5’s performance. User Create a Pandas DataFrame from a list of pairs and visualize the data using a bar chart. - The title… view at source ↗
Figure 28
Figure 28. Figure 28: Conductor role abdication. The Conductor abdicates its role as a subtask planner to Gemini 2.5 Pro, tasking it with devising subtasks for the other models to enact. User You are given an array nums consisting of positive integers. We call a subarray of an array complete if the following condition is satisfied: The number of distinct elements in the subarray is equal to the number of distinct elements in t… view at source ↗
read the original abstract

Powerful large language models (LLMs) from different providers have been expensively trained and finetuned to specialize across varying domains. In this work, we introduce a new kind of Conductor model trained with reinforcement learning to automatically discover powerful coordination strategies among LLMs. Our Conductor learns not only to design targeted communication topologies for effective agent-to-agent collaboration, but also to prompt engineer focused instructions to the LLMs to maximally leverage their individual capabilities. We show that, by learning optimal coordination strategies over pools of powerful worker LLMs, a 7B Conductor achieves significant performance gains beyond any individual worker, attaining state-of-the-art results in challenging reasoning benchmarks, such as LiveCodeBench and GPQA. By training with randomized agent pools, our conductor effectively adapts to arbitrary sets of open- and closed-source agents, meeting any user requirements. Furthermore, allowing the Conductor to select itself as a worker gives rise to recursive topologies, elevating performance with a new form of dynamic test-time scaling through online iterative adaptation. More broadly, ours is among the early work demonstrating language model coordination can be unlocked through RL, where powerful coordination strategies emerge naturally in LLMs through pure end-to-end reward maximization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a Conductor model trained with reinforcement learning to discover coordination strategies among pools of worker LLMs. The Conductor learns communication topologies and targeted prompts to orchestrate agents, with a 7B model claimed to outperform any individual worker and reach state-of-the-art results on LiveCodeBench and GPQA. Additional claims include adaptation to arbitrary open- and closed-source agent sets via randomized pools during training and performance gains from recursive topologies enabled by self-selection as a worker.

Significance. If the central performance claims hold after proper controls, the work would be significant for multi-agent LLM systems. It provides an early demonstration that end-to-end RL on coordination can yield emergent strategies that leverage existing specialized models, potentially offering a path to dynamic test-time scaling without retraining individual workers.

major comments (2)
  1. [Experimental Evaluation] The central claim that learned coordination strategies produce gains beyond any individual worker (and SOTA results) is load-bearing, yet the evaluation lacks matched baselines that apply the identical worker pool under non-learned strategies such as fixed round-robin, static prompt templates, or best-single-model oracle selection. Without these controls, gains cannot be isolated from the choice of worker models or the reward function that scores final-answer correctness.
  2. [Method] The description of the RL training process and reward function requires explicit detail on how final-answer correctness is scored and whether the number of workers queried per step or recursion self-selection rules are fixed or learned; these choices directly affect whether the reported improvements can be attributed to discovered topologies rather than implementation specifics.
minor comments (2)
  1. [Abstract] The abstract states that pools are randomized at training time but does not clarify whether test-time evaluations maintain the same randomization protocol or use fixed pools; adding this distinction would improve reproducibility.
  2. [Method] Notation for communication topologies and prompt-engineering actions could be formalized with a short diagram or pseudocode to clarify how the Conductor outputs are translated into agent interactions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to strengthen the evaluation and clarify the methodology.

read point-by-point responses
  1. Referee: [Experimental Evaluation] The central claim that learned coordination strategies produce gains beyond any individual worker (and SOTA results) is load-bearing, yet the evaluation lacks matched baselines that apply the identical worker pool under non-learned strategies such as fixed round-robin, static prompt templates, or best-single-model oracle selection. Without these controls, gains cannot be isolated from the choice of worker models or the reward function that scores final-answer correctness.

    Authors: We agree that matched baselines using the identical worker pool are necessary to isolate the effect of learned coordination. In the revised manuscript we have added experiments that apply fixed round-robin scheduling and static prompt templates to the same randomized pools. We also include an explicit best-single-model oracle baseline. These new results are reported in an expanded experimental section and confirm that the performance gains arise from the discovered topologies and prompts rather than from worker choice or reward design alone. revision: yes

  2. Referee: [Method] The description of the RL training process and reward function requires explicit detail on how final-answer correctness is scored and whether the number of workers queried per step or recursion self-selection rules are fixed or learned; these choices directly affect whether the reported improvements can be attributed to discovered topologies rather than implementation specifics.

    Authors: We have expanded the Methods section and added an appendix with pseudocode. Final-answer correctness is scored with task-specific metrics (exact match on GPQA, pass@1 on LiveCodeBench). The number of workers queried per step and the decision to include the Conductor itself for recursion are both outputs of the learned policy rather than fixed hyperparameters. These details are now stated explicitly so that the contribution of the discovered topologies can be clearly attributed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL training of Conductor is self-contained

full rationale

The paper describes a standard reinforcement learning setup in which a 7B Conductor is trained end-to-end to maximize a reward based on final answer correctness while selecting communication topologies and prompts over randomized pools of worker LLMs. The reported gains on LiveCodeBench and GPQA are presented as empirical outcomes of this optimization, with additional features such as self-selection for recursion described as emergent behaviors. No equations, definitions, or self-citations are shown that reduce the central performance claim to a fitted parameter or to the input data by construction. The derivation therefore rests on observable training dynamics and benchmark evaluation rather than on any definitional equivalence or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are detailed beyond the introduction of the Conductor concept itself.

invented entities (1)
  • Conductor model no independent evidence
    purpose: Learns coordination strategies and prompts for worker LLMs
    New model type introduced to orchestrate other LLMs via RL.

pith-pipeline@v0.9.0 · 5518 in / 1108 out tokens · 48922 ms · 2026-05-17T01:31:19.421625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper

  1. [1]

    F EXAMPLECONDUCTORCOMPLETIONS We present in this section additional examples of Conductor completions, exemplifying the Conduc- tor’s capability as an agentic workflow coordinator

    For the semi-OOD setting, we balance in-distribution and OOD, taking one example from MATH500, one example from LiveCodeBench, one example from Countdown, and one example from Medreason. F EXAMPLECONDUCTORCOMPLETIONS We present in this section additional examples of Conductor completions, exemplifying the Conduc- tor’s capability as an agentic workflow co...

  2. [2]

    and an example of the Conductor calling an additional agent to check through the past reasoning traces and revise the obtained response (Figure 22). F.1 CONDUCTORCATEGORIZATION We present in this subsection categorization for some of the most frequent orchestration modes we tended to observe throughout Conductor training and evaluation. We note that these...

  3. [3]

    Provide ONLY the final result without explanations, equations, or steps

  4. [4]

    For equations with a single solution, provide just the value (e.g., ’5’ not ’x=5’) ... 3B Conductor Here’s the approach: model id = [0, 1, 2] subtasks = [”Solve the problem step by step and provide the final value ofz 2002 without ex- planations, equations, or steps in<idea>tags. Use standard LaTeX notation for the final result. ”, ”Verify the proposed so...

  5. [5]

    Model 0 directly tackles solving the recurrence relation and providing the final value

  6. [6]

    Model 1 verifies the result by applying the function repeatedly to confirm it converges as expected

  7. [7]

    27 Published as a conference paper at ICLR 2026 Figure 16:In-distribution few-shot examples.The two few-shot examples provided are taken from MATH500 and MMLU

    Model 2 ensures the accuracy of the solution and formats it according to the guidelines. 27 Published as a conference paper at ICLR 2026 Figure 16:In-distribution few-shot examples.The two few-shot examples provided are taken from MATH500 and MMLU. EXAMPLE 1: Question: Subject: Physics. A converging lens is placed 30.0 cm to the right of a diverging lens ...

  8. [8]

    We need to calculate how much Susan pays for her 4 discounted tickets (Model 1). 2. We need to calculate how much Pam pays for her 5 discounted tickets (Model 2). 3. We need to determine the difference in cost between Pam’s and Susan’s tickets (Model 3). model id = [1, 2, 3] subtasks = [”Calculate the cost of 4 tickets at a 25% discount. ”, ”Calculate the...

  9. [9]

    The first model (Model 6) should understand the problem statement and define the objectives

  10. [10]

    The second model (Model 4) should analyze the given graph and its edges to understand the structure. 3. The third model (Model 5) should determine the strategy to find the minimum cost spanning tree. 4. The fourth model (Model 3) will implement the strategy in a Python function. 5. The final model (Model 2) will validate the implemented solution, ensuring...

  11. [11]

    First, we’ll directly attempt to solve the problem. 2. Next, we’ll check and refine the solution if necessary. 3. Finally, we’ll provide the most accurate and detailed solution, following all the given instructions. model id = [0, 1, 2] subtasks = [”Transform the string s into t in exactly k operations and provide the number of ways to do so. Show your wo...

  12. [12]

    Understand the problem and the transportation options available. Determine the strategy to find the minimum time to travel from city 1 to city N

    First, we’ll directly attempt to solve the problem. 2. Next, we’ll check and refine the solution if necessary. 3. Finally, we’ll provide the most accurate and detailed solution, following all the given instructions. model id = [0, 1, 2] subtasks = [”Take the string and devise an algorithm for counting adjacent occurrences of a given letter. Show your work...