pith. sign in

arxiv: 2510.18821 · v3 · pith:KP27Q3ABnew · submitted 2025-10-21 · 💻 cs.LG

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Pith reviewed 2026-05-21 19:54 UTC · model grok-4.3

classification 💻 cs.LG
keywords search self-playRLVRLLM agentsself-play trainingtask synthesisunsupervised RLRAG verificationagent capability
0
0 comments X

The pith

Search self-play lets an LLM generate its own harder tasks and verify their answers to train better search agents without human supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes search self-play where one LLM simultaneously proposes deep search queries with increasing difficulty and solves them as a problem solver. The proposer and solver co-evolve through competition and cooperation in a multi-turn search setting. To create reliable ground-truth labels without human effort, the method collects the proposer's search results as external knowledge and applies retrieval-augmented generation to check whether the query can be answered correctly from those documents. This produces supervision signals for reinforcement learning that scale without task synthesis bottlenecks. Experiments demonstrate uniform performance gains on multiple benchmarks under both from-scratch and continued RL training.

Core claim

In search self-play the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, the system collects all the searching results from the proposer's trajectory as external knowledge, then conducts retrieval-augmentation generation to test whether the proposed query can be correctly answered with all necessary search documents.

What carries the argument

The search self-play game in which a single LLM alternates between proposing increasingly difficult search queries and solving them, with RAG verification on the proposer's own search trajectory supplying the ground-truth labels for RL updates.

If this is right

  • RL training for search agents scales without the human labor of crafting task queries and ground-truth answers.
  • Agent performance improves uniformly across diverse benchmarks in both from-scratch and continuous training regimes.
  • Task difficulty increases automatically through the proposer's generation process during co-evolution.
  • The proposer and solver improve each other through the combination of competition and shared search results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could transfer to other agent domains such as code execution or web navigation where verifiable rewards are scarce.
  • Success depends heavily on the quality and coverage of the underlying search engine used for both proposal and verification.
  • Long-term training may require additional mechanisms to maintain task diversity and prevent the proposer from converging on narrow query patterns.

Load-bearing premise

The RAG verification step that collects search results from the proposer's trajectory and tests answerability actually produces reliable, unbiased ground-truth labels suitable for RL training.

What would settle it

A manual audit of generated tasks finds that a substantial fraction of RAG-verified answers are factually wrong, or multiple independent RL runs with SSP tasks show no gain over standard RLVR baselines on held-out search benchmarks.

read the original abstract

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer's trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents' performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Qwen-Applications/SSP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Search Self-play (SSP), a self-play framework in which an LLM simultaneously acts as a task proposer generating deep search queries and as a solver attempting to answer them. To enable RLVR without external supervision or human-crafted ground truth, the method collects search results exclusively from the proposer's trajectory, applies RAG to verify whether the query admits a correct answer from those documents, and uses the resulting signal to co-evolve proposer and solver capabilities. Experiments report uniform performance gains on multiple benchmarks under both from-scratch and continuous RL training regimes.

Significance. If the RAG-based verification produces reliable, unbiased rewards, SSP would constitute a scalable route to unsupervised RL for agentic search tasks, removing a major bottleneck in task curation and potentially extending the reach of verifiable-reward training to more open-ended domains.

major comments (2)
  1. [§3] §3 (Search Self-play Mechanism), paragraph describing the RAG verification step: the procedure collects all searching results from the proposer's own trajectory and then applies RAG to decide whether the proposed query 'can be correctly answered' with those documents. This creates a potential circularity in which the proposer can surface documents that render its query verifiable by construction; the resulting label may therefore reflect retrieval artifacts or model-specific knowledge rather than an independent ground truth. The central claim of 'without any supervision' and 'genuine capability gains' rests on the assumption that these labels are unbiased; the manuscript provides no ablation that isolates the verification component or compares it against an external oracle.
  2. [§4] §4 (Experimental Setup and Results), the from-scratch and continuous RL tables: while uniform improvements are reported, the manuscript does not present controls for difficulty ramp hyperparameters (listed as free parameters in the axiom ledger) or post-hoc task filtering. Without these, it is difficult to rule out that gains arise from implicit curriculum effects or selective retention of easier queries rather than from the self-play loop itself.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 refer to 'various benchmarks' without enumerating them; adding an explicit list (or a table reference) would improve clarity.
  2. [§3] Notation for the proposer and solver roles is introduced informally; a single diagram or pseudocode block in §3 would help readers track the two-player interaction across turns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. The feedback highlights important considerations for the RAG verification process and experimental controls in Search Self-play (SSP). We address each major comment below, providing clarifications and committing to targeted revisions that will strengthen the presentation of our results without altering the core claims.

read point-by-point responses
  1. Referee: §3 (Search Self-play Mechanism), paragraph describing the RAG verification step: the procedure collects all searching results from the proposer's own trajectory and then applies RAG to decide whether the proposed query 'can be correctly answered' with those documents. This creates a potential circularity in which the proposer can surface documents that render its query verifiable by construction; the resulting label may therefore reflect retrieval artifacts or model-specific knowledge rather than an independent ground truth. The central claim of 'without any supervision' and 'genuine capability gains' rests on the assumption that these labels are unbiased; the manuscript provides no ablation that isolates the verification component or compares it against an external oracle.

    Authors: We appreciate the referee's identification of this potential issue in the verification design. The RAG step uses documents retrieved via the proposer's search trajectory solely to confirm that the generated query admits at least one correct answer extractable from those documents, thereby supplying a verifiable reward without any external human-provided ground truth. This preserves the unsupervised character of the training loop. While we acknowledge that retrieval choices could introduce model- or search-specific biases, the verification is performed by a separate RAG generation pass that does not depend on the proposer's final answer prediction. To directly mitigate the concern, we will add an ablation in the revised manuscript that (i) compares RAG-verified labels against an independent external oracle on a subset of queries and (ii) reports performance when verification is replaced by a fixed external knowledge base. We will also expand §3 with a precise diagram and pseudocode clarifying the temporal order of query proposal, search, and verification. revision: yes

  2. Referee: §4 (Experimental Setup and Results), the from-scratch and continuous RL tables: while uniform improvements are reported, the manuscript does not present controls for difficulty ramp hyperparameters (listed as free parameters in the axiom ledger) or post-hoc task filtering. Without these, it is difficult to rule out that gains arise from implicit curriculum effects or selective retention of easier queries rather than from the self-play loop itself.

    Authors: We agree that additional controls would help isolate the contribution of the self-play dynamics. In the current implementation, the difficulty ramp is a configurable hyperparameter that the proposer uses to modulate query complexity over time, but no post-hoc filtering is applied: every query that passes RAG verification is retained for solver training. The reported uniform gains across diverse benchmarks are therefore produced under this regime. To address the referee's point, we will include new control experiments in the revision in which the difficulty ramp parameters are fixed to constant values and results are reported both with and without any retention logic, thereby demonstrating that performance improvements persist due to the proposer-solver co-evolution rather than curriculum or selection artifacts alone. These controls will be added to §4 and the corresponding tables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The paper describes an empirical self-play training loop for search agents. The proposer generates queries, the solver attempts them, and RAG verification on the proposer's own trajectory documents supplies the reward signal for RL. Success is then measured by performance improvements on separate external benchmarks under from-scratch and continuous RL setups. No equations, fitted parameters, or uniqueness theorems are presented that reduce the reported gains to quantities defined inside the loop itself. The method is self-contained against external evaluation and contains no self-citation load-bearing steps or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that RAG verification produces trustworthy rewards and that the proposer-solver co-evolution produces genuine capability gains rather than artifacts of the verification procedure. No new physical entities are postulated; the method operates within standard LLM and RL frameworks.

free parameters (1)
  • difficulty ramp hyperparameters
    Parameters controlling how rapidly the proposer increases task difficulty are chosen to produce effective training signals.
axioms (1)
  • domain assumption RAG verification on the proposer's collected search documents reliably determines whether a generated query possesses a well-defined, answerable ground truth.
    This premise is invoked to guarantee that rewards supplied to the solver are accurate.

pith-pipeline@v0.9.0 · 5833 in / 1306 out tokens · 130451 ms · 2026-05-21T19:54:29.382128+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EXG: Self-Evolving Agents with Experience Graphs

    cs.AI 2026-05 unverdicted novelty 7.0

    EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.

  2. $\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data

    cs.LG 2026-04 unverdicted novelty 6.0

    π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...

  3. Meta-Tool: Efficient Few-Shot Tool Adaptation for Small Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    A 3B model with few-shot prompting reaches 79.7% of GPT-5 tool-use performance while a hypernetwork adaptation adds zero measurable benefit across four benchmarks.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 3 Pith papers

  1. [1]

    Serl: Self-play reinforcement learning for large language models with limited data.arXiv preprint arXiv:2505.20347, 2025

    URLhttp://arxiv.org/abs/2505.20347. arXiv:2505.20347 [cs]. Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu. Be- yond ten turns: Unlocking long-horizon agentic search with large-scale asynchronous rl.arXiv preprint arXiv:2508.07976, 2025a. Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua...

  2. [2]

    The architect who collaborated with Gertrude Jekyll on the Hestercombe House gardens

  3. [3]

    Hestercombe House

    A 20th-century castle in Devon, England, designed by that architect.</think> <search>architect who collaborated with Gertrude Jekyll on Hestercombe House gardens</search> <information> (Title: "Hestercombe House") the house by Henry Hall in the 1870s. The Edwardian garden was laid out by Gertrude Jekyll and Edwin Lutyens between 1904 and 1906 for the Hon ...

  4. [4]

    Cornelius Hiebert

    was a politician and former Minister of the Crown from Alberta, Canada. He was born in Lieury, Ontario and came west in 1901 while working for an insurance company. After visiting Duhamel (a mission settlement . . . (Title: "Cornelius Hiebert") Cornelius ""Don"" Hiebert (August 2, 1862 – March 21, 1919) was. . . (Title: "Alexander Livingstone (Alberta pol...

  5. [5]

    This establishes your final destination and the complexity of the logical chain you need to construct

    Analyze Scope and Target:Begin by analyzing the provided ’Answer’ (your target) and the required number of searches, ’n’ (the path’s length). This establishes your final destination and the complexity of the logical chain you need to construct

  6. [6]

    You will start from the destination and work your way back to the starting point, step by step

    Build the Question by Working Backwards: This is the core of the process. You will start from the destination and work your way back to the starting point, step by step. 2.1. The Crucial First Step: Connection and Discovery Start with the final ’Answer’, but do not search for the answer itself directly. Instead, first analyze the ’Answer’. Brainstorm and ...

  7. [7]

    The entire logical path to the solution must be grounded in the information you find through searching

    Strictly Fact-Based:You must not create questions based on assumptions. The entire logical path to the solution must be grounded in the information you find through searching. 28 Search Self-play: Pushing the Frontier of Agent Capability without Supervision

  8. [8]

    NoSpoilers:Thequestionmustnotcontainanydirectcluesthatrevealtheanswerortheintermediate steps

  9. [9]

    It must necessitate the search process you have designed

    Search is Mandatory:The question must be impossible to answer from general knowledge alone. It must necessitate the search process you have designed

  10. [10]

    Adhere to Search Count:The number of searches required to solve the question must precisely match the specified ’search count’

  11. [11]

    The clues at each step must be precise enough to prevent a solver from reasonably arriving at a different, valid conclusion

    Unique Answer:The designed question must be deterministic, leading to a single, unambiguous final answer. The clues at each step must be precise enough to prevent a solver from reasonably arriving at a different, valid conclusion. The answer I provided is: {answer}. You need to create a question that requires {n} searches. When you have enough information...

  12. [12]

    The model answer must accurately respond to the question and be consistent with the reference answer in meaning

  13. [13]

    For numerical questions, the values must be equal or very close

  14. [14]

    For textual questions, the core meaning must be correct

  15. [15]

    Differences in wording or language are allowed as long as the core answer is the same

  16. [16]

    Correct" or

    If the model answer includes the correct answer and does not contain conflicting information, it is also considered correct. Please respond only with "Correct" or "Wrong". Do not provide any additional explanation. Question Difficulty Evaluator Prompt You are a search-problem difficulty evaluator. Your task: given a single search-type question, return a s...

  17. [17]

    "Big Daddy

    is an American baritone singer. Nicknamed ""Big Daddy"", he is occasionally also a songwriter and a record producer. Williams is best known as the founder and last surviving original member of the Motown vocal group The Temptations, a group in which he continues to perform; he also owns the rights to the Temptations name. Williams was born Otis Miles, Jr....

  18. [18]

    "Big Daddy

    is an American baritone singer. Nicknamed ""Big Daddy"", he is occasionally also a songwriter and a record producer. Williams. . . Doc 2 (Title: "The Temptations") The Temptations The Temptations are an American vocal group who released a series of successful singles and albums with Motown Records during the 1960s and 1970s. Their work with producer. . . ...

  19. [19]

    "Big Daddy

    is an American baritone singer. Nicknamed ""Big Daddy"", he is occasionally also a songwriter and a record producer. Williams. . . Doc 2 (Title: "Terry Weeks") Terry Weeks Terry Wayne Weeks (born December 23, 1963) is an American R&B and soul singer who is currently one of the lead singers of the legendary Motown quintet The Temptations. Weeks was born in...