pith. machine review for the scientific record. sign in

arxiv: 2604.02988 · v1 · submitted 2026-04-03 · 💻 cs.IR · cs.AI

Recognition: no theorem link

Self-Optimizing Multi-Agent Systems for Deep Research

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:10 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords multi-agent systemsprompt optimizationself-playdeep researchinformation retrievalautomated agentsquery synthesis
0
0 comments X

The pith

Multi-agent systems self-optimize prompts through self-play to match or exceed expert performance in deep research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how multi-agent Deep Research systems can improve by optimizing their own prompts instead of depending on manual engineering. Agents use self-play to experiment with various prompt combinations, iteratively planning, retrieving, and synthesizing information from many documents. This method yields systems that perform at least as well as those built with expert prompts. It tackles the issues of brittleness, high cost, and time consumption in current architectures. Readers might value this because it points toward more flexible and efficient AI tools for handling complex queries.

Core claim

By enabling agents in a multi-agent architecture to self-play and explore different prompt combinations, the system can generate high-quality Deep Research outputs that match or outperform those from expert-crafted prompts, addressing the limitations of static, hand-engineered designs.

What carries the argument

Self-play optimization of prompt combinations, where an orchestrator agent coordinates worker agents that test and refine prompts autonomously.

If this is right

  • Reduces the need for time-consuming hand-engineering of prompts by experts.
  • Creates systems that can potentially adapt to new complex information needs more readily.
  • Lowers the overall cost and effort required to build effective deep research tools.
  • May lead to more robust performance across diverse document collections and queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar self-optimization techniques could apply to other agent-based tasks like multi-step planning or collaborative problem-solving.
  • Over time, such systems might develop emergent behaviors not anticipated in initial designs.
  • Combining this with larger language models could further enhance synthesis capabilities in research scenarios.

Load-bearing premise

That the performance improvements observed from self-play on specific tested tasks will hold for entirely new queries and document sets without the system overfitting to its training environment.

What would settle it

Running the self-optimized system on a fresh set of complex user queries with new document collections and observing if it consistently fails to match or exceed the quality of expert-designed prompts.

Figures

Figures reproduced from arXiv: 2604.02988 by Arthur C\^amara, Jakub Zavrel, Vincent Slot.

Figure 1
Figure 1. Figure 1: Architecture for a multi-agent Deep Research system: 1 an orchestrator agent (orchestrator) creates a list of tasks for the user’s question. Each task consists of a query and instructions. 2 multiple reader agents (reader) inspect batches of doc￾uments and extract the information requested in the task. 3 an aggregator agent (aggregator) combines these smaller information pieces into larger mini-reports for… view at source ↗
Figure 2
Figure 2. Figure 2: Example of exploration trees for both GEPA and TextGrad. Each node in the tree is a new candidate that was generated based on its parent. GEPA manages to explore different variants in a more diversified manner, while TextGrad does not explore that much [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Given a user's complex information need, a multi-agent Deep Research system iteratively plans, retrieves, and synthesizes evidence across hundreds of documents to produce a high-quality answer. In one possible architecture, an orchestrator agent coordinates the process, while parallel worker agents execute tasks. Current Deep Research systems, however, often rely on hand-engineered prompts and static architectures, making improvement brittle, expensive, and time-consuming. We therefore explore various multi-agent optimization methods to show that enabling agents to self-play and explore different prompt combinations can produce high-quality Deep Research systems that match or outperform expert-crafted prompts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes multi-agent Deep Research systems that iteratively plan, retrieve, and synthesize evidence from large document collections to answer complex user queries. It explores self-optimization methods in which agents engage in self-play to discover effective prompt combinations, claiming that the resulting systems can match or outperform those built with expert-crafted prompts and static architectures.

Significance. If the empirical claims hold under rigorous testing, the work could reduce the cost and brittleness of prompt engineering for multi-agent retrieval and synthesis pipelines, offering a path toward more adaptive Deep Research systems. The approach aligns with growing interest in automated agent design within information retrieval.

major comments (2)
  1. [Abstract] Abstract and evaluation sections: the central claim that self-play optimization yields transferable performance gains rests on unverified generalization. No held-out query sets, cross-collection tests, or overfitting controls are described, so reported improvements versus expert prompts could be artifacts of the specific optimization environment rather than robust advances.
  2. [Methods] Methods and results: concrete details on the self-play procedure, prompt search space, optimization algorithm, datasets, metrics (e.g., answer quality, retrieval precision), baselines, and ablation studies are absent. Without these, the performance claim cannot be evaluated or reproduced.
minor comments (2)
  1. [Architecture] Clarify the distinction between the orchestrator and worker agents and how self-play coordinates prompt updates across them.
  2. [Discussion] Add explicit discussion of computational cost and scalability of the self-play process relative to hand-engineering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment point by point below and will revise the manuscript accordingly to improve clarity, rigor, and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract and evaluation sections: the central claim that self-play optimization yields transferable performance gains rests on unverified generalization. No held-out query sets, cross-collection tests, or overfitting controls are described, so reported improvements versus expert prompts could be artifacts of the specific optimization environment rather than robust advances.

    Authors: We agree that the current manuscript does not sufficiently demonstrate generalization. In the revised version we will add held-out query evaluations, cross-collection experiments on additional document sets, and explicit overfitting controls (e.g., monitoring performance on a validation split during self-play). These additions will allow us to test whether the observed gains transfer beyond the optimization environment. revision: yes

  2. Referee: [Methods] Methods and results: concrete details on the self-play procedure, prompt search space, optimization algorithm, datasets, metrics (e.g., answer quality, retrieval precision), baselines, and ablation studies are absent. Without these, the performance claim cannot be evaluated or reproduced.

    Authors: We acknowledge that the submitted draft omits necessary implementation details. The revised manuscript will contain an expanded Methods section that fully specifies the self-play procedure, the prompt search space, the optimization algorithm, the datasets, the evaluation metrics for answer quality and retrieval precision, all baselines, and the ablation studies performed. We will also include pseudocode to support reproducibility. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all details are deferred to the full manuscript.

pith-pipeline@v0.9.0 · 5389 in / 914 out tokens · 30894 ms · 2026-05-13T18:10:18.715268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 4 internal anchors

  1. [1]

    Agrawal, L.A., Tan, S., Soylu, D., Ziems, N., Khare, R., Opsahl-Ong, K., Singhvi, A., Shandilya, H., Ryan, M.J., Jiang, M., Potts, C., Sen, K., Dimakis, A.G., Stoica, I., Klein, D., Zaharia, M., Khattab, O.: Gepa: Reflective prompt evolution can outperform reinforcement learning (2025), https://arxiv.org/abs/2507.19457

  2. [2]

    Asai, A., He, J., Shao, R., Shi, W., Singh, A., Chang, J.C., Lo, K., Soldaini, L., Feldman, S., D’arcy, M., Wadden, D., Latzke, M., Tian, M., Ji, P., Liu, S., Tong, H., Wu, B., Xiong, Y., Zettlemoyer, L., Neubig, G., Weld, D., Downey, D., tau Yih, W., Koh, P.W., Hajishirzi, H.: Openscholar: Synthesizing scientific literature with retrieval-augmented lms (...

  3. [3]

    Coelho, J., Ning, J., He, J., Mao, K., Paladugu, A., Setlur, P., Jin, J., Callan, J., Magalhães, J., Martins, B., Xiong, C.: Deepresearchgym: A free, transparent, and reproducible evaluation sandbox for deep research (2025), https://arxiv.org/abs/2505.19253

  4. [4]

    Dharna, A., Lu, C., Clune, J.: Foundation model self-play: Open-ended strategy innovation via foundation models (2025), https://arxiv.org/abs/2507.06466

  5. [5]

    Gu, Z., Chen, X., Shi, X., Wang, T., Zheng, S., Li, T., Feng, H., Xiao, Y.: Gapo: Learning preferential prompt through generative adversarial policy optimization (2025), https://arxiv.org/abs/2503.20194

  6. [6]

    Hu, S., Lu, C., Clune, J.: Automated design of agentic systems (2025), https://arxiv.org/abs/2408.08435

  7. [7]

    Huang, Y., Chen, Y., Zhang, H., Li, K., Zhou, H., Fang, M., Yang, L., Li, X., Shang, L., Xu, S., Hao, J., Shao, K., Wang, J.: Deep research agents: A systematic examination and roadmap (2025), https://arxiv.org/abs/2506.18096 Self-Optimizing Multi-Agent Systems for Deep Research 9

  8. [8]

    In: The Twelfth International Conference on Learning Representations (2024)

    Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vard- hamanan, S., Haq, S., Sharma, A., Joshi, T.T., Moazam, H., Miller, H., Zaharia, M., Potts, C.: Dspy: Compiling declarative language model calls into self-improving pipelines. In: The Twelfth International Conference on Learning Representations (2024)

  9. [9]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)

  10. [10]

    Rackauckas, Z., Câmara, A., Zavrel, J.: Evaluating rag-fusion with ragelo: an au- tomated elo-based framework (2024), https://arxiv.org/abs/2406.14783

  11. [11]

    Rozanov, N., Rei, M.: StateAct: Enhancing LLM base agents via self-prompting andstate-tracking.In:Proceedingsofthe1stWorkshopforResearchonAgentLan- guage Models (REALM 2025) (2025), https://aclanthology.org/2025.realm-1.27

  12. [12]

    Shao, R., Asai, A., Shen, S.Z., Ivison, H., Kishore, V., Zhuo, J., Zhao, X., Park, M., Finlayson, S.G., Sontag, D., Murray, T., Min, S., Dasigi, P., Soldaini, L., Brahman, F., tau Yih, W., Wu, T., Zettlemoyer, L., Kim, Y., Hajishirzi, H., Koh, P.W.: Dr tulu: Reinforcement learning with evolving rubrics for deep research (2025), https://arxiv.org/abs/2511.19399

  13. [13]

    Shao, Y., Jiang, Y., Kanell, T.A., Xu, P., Khattab, O., Lam, M.S.: Assisting in writing wikipedia-like articles from scratch with large language models (2024), https://arxiv.org/abs/2402.14207

  14. [14]

    Sharma, M., Zhang, C.B.C., Bandi, C., Wang, C., Aich, A., Nghiem, H., Rabbani, T., Htet, Y., Jang, B., Basu, S., Balwani, A., Peskoff, D., Ayestaran, M., Hendryx, S.M., Kenstler, B., Liu, B.: Researchrubrics: A benchmark of prompts and rubrics for evaluating deep research agents (2025), https://arxiv.org/abs/2511.07685

  15. [15]

    Shi, Z., Chen, Y., Li, H., Sun, W., Ni, S., Lyu, Y., Fan, R.Z., Jin, B., Weng, Y., Zhu, M., Xie, Q., Guo, X., Yang, Q., Wu, J., Zhao, J., Tang, X., Ma, X., Wang, C., Mao, J., Ai, Q., Huang, J.T., Wang, W., Zhang, Y., Yang, Y., Tu, Z., Ren, Z.: Deep research: A systematic survey (2025), https://arxiv.org/abs/2512.02038

  16. [16]

    Spiess, C., Vaziri, M., Mandel, L., Hirzel, M.: Autopdl: Automatic prompt opti- mization for llm agents (2025), https://arxiv.org/abs/2504.04365

  17. [17]

    Wang, W., Alyahya, H.A., Ashley, D.R., Serikov, O., Khizbullin, D., Faccio, F., Schmidhuber, J.: How to correctly do semantic backpropagation on language-based agentic systems (2024), https://arxiv.org/abs/2412.03624

  18. [18]

    Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q.V., Zhou, D., Chen, X.: Large language models as optimizers (2024), https://arxiv.org/abs/2309.03409

  19. [19]

    TextGrad: Automatic "Differentiation" via Text

    Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., Zou, J.: Textgrad: Automatic "differentiation" via text (2024), https://arxiv.org/abs/2406.07496

  20. [20]

    Zhang, J., Hu, S., Lu, C., Lange, R., Clune, J.: Darwin godel machine: Open-ended evolution of self-improving agents (2025), https://arxiv.org/abs/2505.22954

  21. [21]

    In: Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) (2024), https://aclanthology.org/2024.acl-long.292

    Zhang, W., Tang, K., Wu, H., Wang, M., Shen, Y., Hou, G., Tan, Z., Li, P., Zhuang, Y., Lu, W.: Agent-pro: Learning to evolve via policy-level reflection and optimiza- tion. In: Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL) (2024), https://aclanthology.org/2024.acl-long.292

  22. [22]

    Zhou, H., Wan, X., Sun, R., Palangi, H., Iqbal, S., Vulić, I., Korhonen, A., Arık, S.Ö.: Multi-agent design: Optimizing agents with better prompts and topologies (2025), https://arxiv.org/abs/2502.02533 10 Arthur Câmara, Vincent Slot, and Jakub Zavrel A Minimal prompts Minimalorchestratorprompt: Given a user query , create a report that answer the user ’ ...

  23. [23]

    The ‘ orchestrator ‘ receives a user ’ s question and devises a plan with a list of research tasks that need to be co mpl et ed before writing the final report

  24. [24]

    The i n f o r m a t i o n of all search results pages is then combined by the ‘ aggregator ‘

    Each task ’ s query is su bm it ted to a search engine , and the relevant i n f o r m a t i o n from each results page is ex tr ac te d by the ‘ reader ‘. The i n f o r m a t i o n of all search results pages is then combined by the ‘ aggregator ‘. Self-Optimizing Multi-Agent Systems for Deep Research 11

  25. [25]

    The ‘ orchestrator ‘ reads the merged i n f o r m a t i o n for all s ub mi tt ed tasks and decides to either run another round of tasks or call the ‘ writer ‘

  26. [26]

    The ‘ writer ‘ receives all the i n f o r m a t i o n from the tasks and writes a final report . I will provide you with a list of examples of di ff er ent task inputs provided to a single agent , together with some feedback on the quality of the output ge ner at ed by the agent using its current i n s t r u c t i o n s . Read the inputs c ar ef ul ly and...

  27. [27]

    Make sure your new i n s t r u c t i o n s are g e n e r a l i z a b l e to any computer science related task , and not specific to any p a r t i c u l a r task present in the examples

  28. [28]

    Do not include any other i n f o r m a t i o n or comments in your response

  29. [29]

    In this round , you are o p t i m i z i n g the prompt of the ‘{{ a g e n t _ n a m e }} ‘ agent

    Do not suggest or imply any f o r m a t t i n g to the output of the agent , like r eq uir in g the output to be a JSON or have specific fields , unless this is already present in the current i n s t r u c t i o n s . In this round , you are o p t i m i z i n g the prompt of the ‘{{ a g e n t _ n a m e }} ‘ agent . C Exploration trees Figure2showsanexampl...