arxiv: 2604.02988 · v1 · submitted 2026-04-03 · 💻 cs.IR · cs.AI

Recognition: no theorem link

Self-Optimizing Multi-Agent Systems for Deep Research

Arthur C\^amara , Vincent Slot , Jakub Zavrel

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:10 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords multi-agent systemsprompt optimizationself-playdeep researchinformation retrievalautomated agentsquery synthesis

0 comments

The pith

Multi-agent systems self-optimize prompts through self-play to match or exceed expert performance in deep research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how multi-agent Deep Research systems can improve by optimizing their own prompts instead of depending on manual engineering. Agents use self-play to experiment with various prompt combinations, iteratively planning, retrieving, and synthesizing information from many documents. This method yields systems that perform at least as well as those built with expert prompts. It tackles the issues of brittleness, high cost, and time consumption in current architectures. Readers might value this because it points toward more flexible and efficient AI tools for handling complex queries.

Core claim

By enabling agents in a multi-agent architecture to self-play and explore different prompt combinations, the system can generate high-quality Deep Research outputs that match or outperform those from expert-crafted prompts, addressing the limitations of static, hand-engineered designs.

What carries the argument

Self-play optimization of prompt combinations, where an orchestrator agent coordinates worker agents that test and refine prompts autonomously.

If this is right

Reduces the need for time-consuming hand-engineering of prompts by experts.
Creates systems that can potentially adapt to new complex information needs more readily.
Lowers the overall cost and effort required to build effective deep research tools.
May lead to more robust performance across diverse document collections and queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar self-optimization techniques could apply to other agent-based tasks like multi-step planning or collaborative problem-solving.
Over time, such systems might develop emergent behaviors not anticipated in initial designs.
Combining this with larger language models could further enhance synthesis capabilities in research scenarios.

Load-bearing premise

That the performance improvements observed from self-play on specific tested tasks will hold for entirely new queries and document sets without the system overfitting to its training environment.

What would settle it

Running the self-optimized system on a fresh set of complex user queries with new document collections and observing if it consistently fails to match or exceed the quality of expert-designed prompts.

Figures

Figures reproduced from arXiv: 2604.02988 by Arthur C\^amara, Jakub Zavrel, Vincent Slot.

**Figure 1.** Figure 1: Architecture for a multi-agent Deep Research system: 1 an orchestrator agent (orchestrator) creates a list of tasks for the user’s question. Each task consists of a query and instructions. 2 multiple reader agents (reader) inspect batches of documents and extract the information requested in the task. 3 an aggregator agent (aggregator) combines these smaller information pieces into larger mini-reports for… view at source ↗

**Figure 2.** Figure 2: Example of exploration trees for both GEPA and TextGrad. Each node in the tree is a new candidate that was generated based on its parent. GEPA manages to explore different variants in a more diversified manner, while TextGrad does not explore that much [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Given a user's complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. We therefore explore various multi-agent optimization methods to show that enabling agents to self-play and explore different prompt combinations can produce high-quality Deep Research systems that match or outperform expert-crafted prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-play prompt optimization in multi-agent deep research can beat hand-crafted prompts but the generalization claim rests on unshown evidence.

read the letter

The main point is that multi-agent systems for complex research tasks can self-optimize their prompts via self-play and reach performance levels that match or exceed expert-designed ones. This targets a genuine bottleneck where current setups depend on brittle, hand-tuned prompts for orchestrating retrieval and synthesis across large document sets. The paper frames the architecture clearly with an orchestrator agent directing parallel workers, and it applies self-play exploration to prompt combinations in a way that extends techniques from other agent work to this retrieval-synthesis loop. That application is the concrete step forward here, and it gives a direct path to reduce manual engineering effort. The stress-test note on generalization holds up based on the description: there is no mention of held-out query sets, cross-collection tests, or controls that would show the optimized prompts transfer rather than fit the specific optimization environment. Without those, any reported gains could be tied to the training tasks. The work stays at a high level on methods and results, so it is hard to judge reproducibility or the exact scale of improvement. This is the kind of paper that would interest researchers building multi-agent retrieval systems who already deal with prompt sensitivity. It is worth a serious referee because the problem is practical and the direction is worth testing in detail, though the current version would need expanded experiments and validation sections to stand on its own.

Referee Report

2 major / 2 minor

Summary. The manuscript describes multi-agent Deep Research systems that iteratively plan, retrieve, and synthesize evidence from large document collections to answer complex user queries. It explores self-optimization methods in which agents engage in self-play to discover effective prompt combinations, claiming that the resulting systems can match or outperform those built with expert-crafted prompts and static architectures.

Significance. If the empirical claims hold under rigorous testing, the work could reduce the cost and brittleness of prompt engineering for multi-agent retrieval and synthesis pipelines, offering a path toward more adaptive Deep Research systems. The approach aligns with growing interest in automated agent design within information retrieval.

major comments (2)

[Abstract] Abstract and evaluation sections: the central claim that self-play optimization yields transferable performance gains rests on unverified generalization. No held-out query sets, cross-collection tests, or overfitting controls are described, so reported improvements versus expert prompts could be artifacts of the specific optimization environment rather than robust advances.
[Methods] Methods and results: concrete details on the self-play procedure, prompt search space, optimization algorithm, datasets, metrics (e.g., answer quality, retrieval precision), baselines, and ablation studies are absent. Without these, the performance claim cannot be evaluated or reproduced.

minor comments (2)

[Architecture] Clarify the distinction between the orchestrator and worker agents and how self-play coordinates prompt updates across them.
[Discussion] Add explicit discussion of computational cost and scalability of the self-play process relative to hand-engineering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment point by point below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation sections: the central claim that self-play optimization yields transferable performance gains rests on unverified generalization. No held-out query sets, cross-collection tests, or overfitting controls are described, so reported improvements versus expert prompts could be artifacts of the specific optimization environment rather than robust advances.

Authors: We agree that the current manuscript does not sufficiently demonstrate generalization. In the revised version we will add held-out query evaluations, cross-collection experiments on additional document sets, and explicit overfitting controls (e.g., monitoring performance on a validation split during self-play). These additions will allow us to test whether the observed gains transfer beyond the optimization environment. revision: yes
Referee: [Methods] Methods and results: concrete details on the self-play procedure, prompt search space, optimization algorithm, datasets, metrics (e.g., answer quality, retrieval precision), baselines, and ablation studies are absent. Without these, the performance claim cannot be evaluated or reproduced.

Authors: We acknowledge that the submitted draft omits necessary implementation details. The revised manuscript will contain an expanded Methods section that fully specifies the self-play procedure, the prompt search space, the optimization algorithm, the datasets, the evaluation metrics for answer quality and retrieval precision, all baselines, and the ablation studies performed. We will also include pseudocode to support reproducibility. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all details are deferred to the full manuscript.

pith-pipeline@v0.9.0 · 5389 in / 914 out tokens · 30894 ms · 2026-05-13T18:10:18.715268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

[1]

Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M., Potts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., Khattab, O.: Gepa: Reflective prompt evolution can outperform reinforcement learning (2025), https://arxiv.org/abs/2507.19457

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Asai, A., He, J., Shao, R., Shi, W., Singh, A., Chang, J.C., Lo, K., Soldaini, L., Feldman, S., D’arcy, M., Wadden, D., Latzke, M., Tian, M., Ji, P., Liu, S., Tong, H., Wu, B., Xiong, Y., Zettlemoyer, L., Neubig, G., Weld, D., Downey, D., tau Yih, W., Koh, P.W., Hajishirzi, H.: Openscholar: Synthesizing scientific literature with retrieval-augmented lms (...

work page arXiv 2024
[3]

Coelho, J., Ning, J., He, J., Mao, K., Paladugu, A., Setlur, P., Jin, J., Callan, J., Magalhães, J., Martins, B., Xiong, C.: Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research (2025), https://arxiv.org/abs/2505.19253

work page arXiv 2025
[4]

Dharna, A., Lu, C., Clune, J.: Foundation model self-play: Open-ended strategy innovation via foundation models (2025), https://arxiv.org/abs/2507.06466

work page arXiv 2025
[5]

Gu, Z., Chen, X., Shi, X., Wang, T., Zheng, S., Li, T., Feng, H., Xiao, Y.: Gapo: Learning preferential prompt through generative adversarial policy optimization (2025), https://arxiv.org/abs/2503.20194

work page arXiv 2025
[6]

Hu, S., Lu, C., Clune, J.: Automated design of agentic systems (2025), https://arxiv.org/abs/2408.08435

work page internal anchor Pith review arXiv 2025
[7]

Huang, Y., Chen, Y., Zhang, H., Li, K., Zhou, H., Fang, M., Yang, L., Li, X., Shang, L., Xu, S., Hao, J., Shao, K., Wang, J.: Deep research agents: A systematic examination and roadmap (2025), https://arxiv.org/abs/2506.18096 Self-Optimizing Multi-Agent Systems for Deep Research 9

work page arXiv 2025
[8]

In: The Twelfth International Conference on Learning Representations (2024)

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., Potts, C.: Dspy: Compiling declarative language model calls into self-improving pipelines. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

work page 2020
[10]

Rackauckas, Z., Câmara, A., Zavrel, J.: Evaluating rag-fusion with ragelo: an au- tomated elo-based framework (2024), https://arxiv.org/abs/2406.14783

work page arXiv 2024
[11]

Rozanov, N., Rei, M.: StateAct: Enhancing LLM base agents via self-prompting andstate-tracking.In:Proceedingsofthe1stWorkshopforResearchonAgentLan- guage Models (REALM 2025) (2025), https://aclanthology.org/2025.realm-1.27

work page 2025
[12]

Shao, R., Asai, A., Shen, S.Z., Ivison, H., Kishore, V., Zhuo, J., Zhao, X., Park, M., Finlayson, S.G., Sontag, D., Murray, T., Min, S., Dasigi, P., Soldaini, L., Brahman, F., tau Yih, W., Wu, T., Zettlemoyer, L., Kim, Y., Hajishirzi, H., Koh, P.W.: Dr tulu: Reinforcement learning with evolving rubrics for deep research (2025), https://arxiv.org/abs/2511.19399

work page arXiv 2025
[13]

Shao, Y., Jiang, Y., Kanell, T.A., Xu, P., Khattab, O., Lam, M.S.: Assisting in writing wikipedia-like articles from scratch with large language models (2024), https://arxiv.org/abs/2402.14207

work page arXiv 2024
[14]

Sharma, M., Zhang, C.B.C., Bandi, C., Wang, C., Aich, A., Nghiem, H., Rabbani, T., Htet, Y., Jang, B., Basu, S., Balwani, A., Peskoff, D., Ayestaran, M., Hendryx, S.M., Kenstler, B., Liu, B.: Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents (2025), https://arxiv.org/abs/2511.07685

work page arXiv 2025
[15]

Shi, Z., Chen, Y., Li, H., Sun, W., Ni, S., Lyu, Y., Fan, R.Z., Jin, B., Weng, Y., Zhu, M., Xie, Q., Guo, X., Yang, Q., Wu, J., Zhao, J., Tang, X., Ma, X., Wang, C., Mao, J., Ai, Q., Huang, J.T., Wang, W., Zhang, Y., Yang, Y., Tu, Z., Ren, Z.: Deep research: A systematic survey (2025), https://arxiv.org/abs/2512.02038

work page arXiv 2025
[16]

Spiess, C., Vaziri, M., Mandel, L., Hirzel, M.: Autopdl: Automatic prompt opti- mization for llm agents (2025), https://arxiv.org/abs/2504.04365

work page arXiv 2025
[17]

Wang, W., Alyahya, H.A., Ashley, D.R., Serikov, O., Khizbullin, D., Faccio, F., Schmidhuber, J.: How to correctly do semantic backpropagation on language-based agentic systems (2024), https://arxiv.org/abs/2412.03624

work page arXiv 2024
[18]

Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers (2024), https://arxiv.org/abs/2309.03409

work page internal anchor Pith review arXiv 2024
[19]

TextGrad: Automatic "Differentiation" via Text

Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., Zou, J.: Textgrad: Automatic "differentiation" via text (2024), https://arxiv.org/abs/2406.07496

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Zhang, J., Hu, S., Lu, C., Lange, R., Clune, J.: Darwin godel machine: Open-ended evolution of self-improving agents (2025), https://arxiv.org/abs/2505.22954

work page arXiv 2025
[21]

In: Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) (2024), https://aclanthology.org/2024.acl-long.292

Zhang, W., Tang, K., Wu, H., Wang, M., Shen, Y., Hou, G., Tan, Z., Li, P., Zhuang, Y., Lu, W.: Agent-pro: Learning to evolve via policy-level reflection and optimiza- tion. In: Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) (2024), https://aclanthology.org/2024.acl-long.292

work page 2024
[22]

Zhou, H., Wan, X., Sun, R., Palangi, H., Iqbal, S., Vulić, I., Korhonen, A., Arık, S.Ö.: Multi-agent design: Optimizing agents with better prompts and topologies (2025), https://arxiv.org/abs/2502.02533 10 Arthur Câmara, Vincent Slot, and Jakub Zavrel A Minimal prompts Minimalorchestratorprompt: Given a user query , create a report that answer the user ’ ...

work page arXiv 2025
[23]

The ‘ orchestrator ‘ receives a user ’ s question and devises a plan with a list of research tasks that need to be co mpl et ed before writing the final report

work page
[24]

The i n f o r m a t i o n of all search results pages is then combined by the ‘ aggregator ‘

Each task ’ s query is su bm it ted to a search engine , and the relevant i n f o r m a t i o n from each results page is ex tr ac te d by the ‘ reader ‘. The i n f o r m a t i o n of all search results pages is then combined by the ‘ aggregator ‘. Self-Optimizing Multi-Agent Systems for Deep Research 11

work page
[25]

The ‘ orchestrator ‘ reads the merged i n f o r m a t i o n for all s ub mi tt ed tasks and decides to either run another round of tasks or call the ‘ writer ‘

work page
[26]

The ‘ writer ‘ receives all the i n f o r m a t i o n from the tasks and writes a final report . I will provide you with a list of examples of di ff er ent task inputs provided to a single agent , together with some feedback on the quality of the output ge ner at ed by the agent using its current i n s t r u c t i o n s . Read the inputs c ar ef ul ly and...

work page
[27]

Make sure your new i n s t r u c t i o n s are g e n e r a l i z a b l e to any computer science related task , and not specific to any p a r t i c u l a r task present in the examples

work page
[28]

Do not include any other i n f o r m a t i o n or comments in your response

work page
[29]

In this round , you are o p t i m i z i n g the prompt of the ‘{{ a g e n t _ n a m e }} ‘ agent

Do not suggest or imply any f o r m a t t i n g to the output of the agent , like r eq uir in g the output to be a JSON or have specific fields , unless this is already present in the current i n s t r u c t i o n s . In this round , you are o p t i m i z i n g the prompt of the ‘{{ a g e n t _ n a m e }} ‘ agent . C Exploration trees Figure2showsanexampl...

work page