OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories
Pith reviewed 2026-05-06 05:14 UTC · model claude-opus-4-7
The pith
A 30B search agent trained with supervised fine-tuning alone on 10.6k carefully filtered trajectories matches and beats agents built with the full continual-pretraining plus reinforcement-learning pipeline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors argue that the heavy industrial recipe for training deep-search agents — continual pre-training, supervised fine-tuning, and reinforcement learning stacked together — is not actually required to reach state-of-the-art at the 30B scale. Three targeted changes to the synthetic training data are enough: enlarge the knowledge-graph neighborhood used to seed each question so that answers genuinely require multi-hop evidence; broaden the tool set so agents learn diverse interaction patterns; and discard any trajectory that solves in too few tool calls, enforcing a minimum-difficulty floor. With only 10.6k such trajectories and pure SFT on a 30B Qwen3 thinking model, the resulting agent
What carries the argument
A synthetic trajectory pipeline with three knobs: (1) a graph-expansion budget K that grows the evidence subgraph around each seed node so generated questions structurally demand multi-hop aggregation; (2) an enlarged action set A from which ReAct trajectories are drawn; (3) a hard threshold T_min on tool-call count that discards shallow trajectories before SFT. The agent itself is a ReAct loop over Qwen3-30B-A3B-Thinking-2507 with a 256k context and up to 200 tool calls.
If this is right
- State-of-the-art deep-search behavior at 30B does not require reinforcement learning, only sufficiently long and difficult supervised trajectories.
- Trajectory length distribution (average tool-call count) is a usable proxy for training-data difficulty when curating search-agent corpora.
- Filtering out short-trajectory examples — even at the cost of dataset size — improves long-horizon search performance more than adding easy data.
- Academic-scale teams can produce competitive frontier search agents if they invest in graph-driven question synthesis rather than RL infrastructure.
- Knowledge-graph expansion budget and tool-set size are tunable levers for question difficulty in synthetic agent data.
Where Pith is reading between the lines
- The 64.7-step average trajectory length suggests the model is being taught a particular search rhythm; benchmarks that reward shorter solutions (or penalize tool-call count) would likely show a smaller margin or reversal.
- Because only the v1→v2 endpoints are compared, the relative contributions of graph scaling, tool-set expansion, and low-step filtering are entangled; the low-step filter alone could plausibly account for most of the gain, since it is the only modification that directly raises the difficulty floor of every retained example.
- If the underlying base model (Qwen3-30B-A3B-Thinking-2507) already encodes much of the reasoning capacity, the result may generalize less well to base models without a strong native thinking mode, making the recipe partly a story about pairing difficult SFT data with thinking-tuned bases.
- The approach implies a natural curriculum extension: progressively raise T_min during training rather than applying a single floor, which could push performance further without growing dataset size.
Load-bearing premise
That the benchmark gains reflect the agent learning to search and reason in general, rather than the synthetic training questions being drawn from the same kinds of web sources and structures the benchmarks themselves probe.
What would settle it
Run a contamination and distribution audit: rebuild the 10.6k trajectories from a knowledge graph seeded only from sources demonstrably disjoint from BrowseComp, BrowseComp-ZH, HLE, and xbench evidence, and retrain. If the four-benchmark scores fall back to the v1 range (around 30 on BrowseComp, 48 on BC-ZH, 74 on xbench), the gains were dataset-similarity rather than data-difficulty. A second falsifier: ablate each of the three modifications separately; if no single one moves the needle, the v1→v2 jump must be attributed to something else.
Figures
read the original abstract
Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The report introduces OpenSeeker-v2, a 30B-parameter ReAct search agent obtained by supervised fine-tuning Qwen3-30B-A3B-Thinking-2507 on 10.6k synthetic trajectories. Three modifications to the v1 data-synthesis pipeline are proposed: (i) enlarging the seed knowledge-graph subgraph used to generate queries, (ii) expanding the available tool set, and (iii) discarding trajectories with fewer than T_min tool calls. The authors report 46.0 / 58.1 / 34.6 / 78.0 on BrowseComp / BrowseComp-ZH / HLE / xbench, claiming SOTA among ~30B ReAct-paradigm agents and surpassing CPT+SFT+RL systems (Tongyi DeepResearch, RedSearcher) under the same scale. The headline conclusion is that careful data curation alone, without RL or CPT, can match or exceed heavier industrial pipelines.
Significance. If the comparative claim holds under matched evaluation conditions, this is a genuinely useful negative result against the necessity of CPT+RL stages for 30B-scale ReAct search agents, and the released weights + 10.6k-trajectory recipe would lower the barrier to entry for academic groups. The paper's strengths to credit explicitly: (a) full open-sourcing of weights and code, (b) a small, auditable training set (10.6k trajectories), (c) explicit acknowledgement and partial mitigation of HF link leakage during evaluation, and (d) a clear, falsifiable headline number that others can re-run. The methodological novelty is modest — the three modifications are individually unsurprising — but the empirical scaling evidence (v1 → v2 with the same recipe) is informative if interpretable.
major comments (4)
- [§2.2 Baselines / Table 1] The central comparative claim relies on baseline numbers 'taken from their technical reports or public leaderboards', while OpenSeeker-v2 itself is run with a 256k context and up to 200 tool calls per trajectory (§2.2 Implementation). BrowseComp and HLE scores are known to scale with browsing budget, and Figure 2 documents that the v2 *training* trajectories average 64.67 tool calls (vs 36.01 for RedSearcher). The 2.6-point BrowseComp gap and the 0.3-point HLE gap over Tongyi DeepResearch / RedSearcher are within the range plausibly attributable to harness, search backend, retry/timeout, and tool-budget differences. Without a matched-harness rerun of at least one CPT+SFT+RL baseline (ideally Tongyi DeepResearch, whose weights are public) inside the same sandbox and tool-call cap as OpenSeeker-v2, the headline 'SFT rivals RL' cannot be cleanly read off Table 1. Please add such a rerun, or
- [§2.1 Methodology] Three modifications are introduced (graph scaling K, tool-set expansion |A|, low-step filter T_min), but no ablation isolates their individual contributions. Only an endpoint v1→v2 comparison is shown. As written, the report cannot distinguish whether gains come from (i) longer/multi-hop queries, (ii) more tool diversity, (iii) the difficulty-floor filter, or (iv) some combination — and in particular cannot rule out that the low-step filter alone (which strongly biases the training distribution toward long-horizon trajectories matching the BrowseComp/HLE evaluation regime) accounts for most of the lift. Please report at minimum three single-modification ablations off a v1 base, plus the values of K, |A|, and T_min used (none are stated numerically in §2.1).
- [§2.2 Benchmarks] Contamination control is described in a single sentence: 'We mask the hugging-face-related links when calling the web search tools to avoid potential leakage.' BrowseComp/BC-ZH/HLE/xbench items are public and indexed on many non-HF domains; the synthesis pipeline itself draws evidence from web tools over a knowledge graph, i.e. from the same evidence ecosystem the benchmarks query. Please report (a) an n-gram or embedding-based overlap audit between the 10.6k synthesized queries and each benchmark's questions/answers, and (b) per-benchmark performance after excluding any synthesized item whose evidence subgraph overlaps a benchmark item's gold sources. Without this, distributional contamination is a live alternative explanation for the magnitude of the gain over v1, particularly the +16.5 jump on BrowseComp.
- [Table 1 / §2.3] The 'first SOTA by a purely academic team using only SFT' framing is load-bearing for the contribution but is not invariant to the comparison set. WebSailor-V2-30B-SFT (24.4 BrowseComp) is cited, but 'SFT-only' upper bounds for several other listed systems are not. Please clarify whether v2's SFT data quantity (10.6k) was held fixed when comparing to v1 (11.7k), and whether the gain attributable to switching base model checkpoints or sampling temperature has been ruled out — the report does not state whether v1 and v2 share the exact same base, optimizer, and inference decoding settings.
minor comments (7)
- [Abstract / §1] The abstract describes 'three simple data synthesis modifications' but the prose in §1 then says 'two simple yet highly effective modifications' before listing three. Please reconcile.
- [§2.1] Symbols K, k, |A|, T_min, and the size of D_raw vs D_v2 are introduced abstractly but never assigned numerical values. Please specify the actual hyperparameters used.
- [§2.2 Benchmarks] The text says 'five challenging agentic benchmarks' but only four are listed (BrowseComp, BC-ZH, HLE, xbench).
- [Figure 2] The figure compares *training* tool-call distributions, but the caption ('average tool call counts across search-agent training data') and the surrounding prose slide between training-trajectory length and inference-time behavior. Please add the inference-time tool-call distribution on each benchmark for OpenSeeker-v2 and at least one re-run baseline; this is also directly relevant to the harness-mismatch concern in major comment 1.
- [References] Several arXiv identifiers appear post-dated (e.g. 2602.x, 2603.x, 2605.04036 itself, 2606.x). If this is intentional, a footnote at first occurrence would prevent reader confusion.
- [§2.3] 'OpenSeeker-v2 outperforms ... DeepSeek-V3.1-671B, GLM-4.6-357B, Minimax-M2-230B, Claude-4.5-Sonnet' — these are general-purpose models not necessarily configured as ReAct search agents in their reported numbers. The comparison would be cleaner if framed as 'agentic-benchmark scores under each model's reported configuration' with the caveat stated.
- [Table 1] The '# Samples' column is '?' for most rows, including the CPT+SFT+RL competitors. Where these numbers are public (Tongyi DeepResearch report), please populate; where not, a footnote stating they are undisclosed would be clearer than '?'.
Simulated Author's Rebuttal
We thank the referee for a careful and constructive report. The referee correctly identifies that our headline claim — that SFT-only training on 10.6k high-difficulty trajectories can rival CPT+SFT+RL pipelines at the 30B/ReAct scale — depends critically on (i) matched-harness comparison, (ii) attribution of the gain to specific data-pipeline modifications, (iii) a defensible contamination audit, and (iv) a clean v1→v2 controlled comparison. We agree that the current report under-supports these four pillars and we will revise accordingly. In brief, we will (a) rerun Tongyi DeepResearch (whose weights are public) inside our own sandbox under our 256k context / 200-tool-call budget; (b) add three single-modification ablations off the v1 base, and report the numerical values of K, |A|, and T_min; (c) add an n-gram and embedding overlap audit between the 10.6k synthesized queries and each benchmark, plus a "decontaminated" rerun excluding overlapping items; (d) explicitly state the base checkpoint, optimizer, and decoding settings used for both v1 and v2, and report a v2-recipe-on-v1-base / v1-recipe-on-v2-base cross-check. We acknowledge that until these are in place, the "SFT rivals RL" framing should be presented as a hypothesis supported by the current numbers rather than as a settled comparative claim, and we will soften the abstract and §2.3 wording accordingly. We retain the report's core empirical contribution — the v1→v2 jump under a fixed recipe — as the most defensible
read point-by-point responses
-
Referee: Headline 'SFT rivals RL' relies on baseline numbers from technical reports/leaderboards while OpenSeeker-v2 runs with 256k context and up to 200 tool calls; gaps of 2.6 / 0.3 points may be within harness variance. Please add a matched-harness rerun of at least one CPT+SFT+RL baseline (ideally Tongyi DeepResearch).
Authors: We agree this is the central methodological gap. The 2.6-point BrowseComp and 0.3-point HLE margins are indeed within plausible harness/budget variance, and we should not draw the 'SFT rivals RL' conclusion from numbers collected under heterogeneous conditions. For the revision we commit to: (1) rerunning Tongyi DeepResearch (public weights) inside our own sandbox, with the identical search backend, retry/timeout policy, 256k context, and 200-tool-call cap used for OpenSeeker-v2, and reporting the resulting numbers in a new column of Table 1; (2) where feasible, doing the same for WebSailor-V2-30B-RL (also public). We will additionally report OpenSeeker-v2 under a reduced tool-call cap (e.g., 64) matching the average trajectory length of the comparison systems' training data, so readers can see how much of the margin survives a tighter budget. If the matched-harness rerun shrinks the BrowseComp/HLE gaps to within noise, we will explicitly retract the 'surpassing CPT+SFT+RL' framing in the abstract and §2.3 and replace it with the weaker, defensible claim that SFT-only training is competitive with CPT+SFT+RL at this scale on BC-ZH and xbench, where the margins (11.4 and 3.0 points) are large enough to plausibly survive harness normalization. revision: yes
-
Referee: No ablation isolates the contributions of graph scaling K, tool-set expansion |A|, or low-step filter T_min; numerical values of K, |A|, T_min are not stated. The low-step filter alone may explain most of the lift by biasing toward long-horizon trajectories matching BrowseComp/HLE.
Authors: The referee is correct that the report only shows endpoint v1→v2. We will add to §2.1: (a) the numerical values used (K, |A|, T_min, plus the v1 counterparts), which are currently in our codebase but were elided from the report; (b) three single-modification ablations off the v1 base — v1+K only, v1+|A| only, v1+T_min only — each trained for the same number of steps as v2, evaluated on all four benchmarks; and (c) the leave-one-out counterparts (v2 minus each modification). We agree that the low-step filter is the most likely single dominant factor, precisely because it shifts the training distribution toward long-horizon trajectories that resemble the BrowseComp/HLE regime, and we will state this hypothesis explicitly and let the ablation adjudicate it. If T_min alone reproduces most of the gain, this is itself an interesting and publishable finding — but it changes the narrative from 'three complementary modifications' to 'difficulty-floor filtering is what matters,' and we will rewrite §2.1 and §3 accordingly. revision: yes
-
Referee: Contamination control is one sentence (masking HF links). Benchmarks are indexed on many non-HF domains, and the synthesis pipeline draws from the same web evidence ecosystem. Please add (a) n-gram/embedding overlap audit between 10.6k synthesized queries and each benchmark, and (b) a per-benchmark rerun excluding synthesized items whose evidence subgraph overlaps benchmark gold sources.
Authors: We accept this concern in full. HF-link masking addresses only one well-known leakage channel and does not bound distributional contamination. For the revision we will add an Appendix containing: (1) a 13-gram exact-match audit between every synthesized query/answer and the public questions/answers of BrowseComp, BC-ZH, HLE, and xbench-DeepSearch; (2) an embedding-based nearest-neighbor audit (using a strong sentence encoder) reporting the distribution of cosine similarities and a flagged set above a conservative threshold; (3) a URL/domain-level overlap audit between the evidence subgraphs used to synthesize each training item and the gold-source URLs published with the benchmarks where available; and (4) a 'decontaminated' retraining of OpenSeeker-v2 with all flagged items removed, with the four benchmark numbers reported alongside the original. We agree that, until these results are presented, distributional contamination remains a live alternative explanation for the +16.5 BrowseComp jump from v1, and we will say so explicitly in §2.3. revision: yes
-
Referee: Clarify whether v2 SFT data quantity (10.6k) was held fixed against v1 (11.7k), and whether gains attributable to base-model checkpoint or sampling temperature have been ruled out. The report does not state whether v1 and v2 share base, optimizer, and inference decoding settings.
Authors: Thank you — these details should have been in the report. v2 is instantiated from Qwen3-30B-A3B-Thinking-2507; we will state explicitly in the revision whether v1 used the identical base checkpoint (it did, in our internal pipeline) and report the full optimizer settings (lr, schedule, batch size, epochs) and inference decoding (temperature, top-p, max output, tool-call cap) used for both. The data quantities (11.7k vs 10.6k) were not held fixed; we will add a controlled comparison at matched 10.6k by subsampling v1 to 10.6k and retraining, and conversely at 11.7k by extending v2 to 11.7k, so that quantity is removed as a confound. We will also rerun v1 evaluation with v2's exact inference settings to rule out a decoding-level shift. Regarding the 'first SOTA by a purely academic team using only SFT' framing: we will retain it only if it survives the matched-harness rerun (Major 1) and the decontamination rerun (Major 3); otherwise we will weaken it to 'a strong SFT-only academic baseline' and let readers judge. revision: yes
- We cannot, within the rebuttal window, guarantee that the matched-harness rerun of Tongyi DeepResearch will preserve the 2.6-point BrowseComp and 0.3-point HLE margins; if it does not, the 'surpassing CPT+SFT+RL' claim in the abstract will need to be withdrawn rather than defended.
- We cannot rule out a priori that the low-step filter T_min is the single dominant factor and that the graph-scaling and tool-set-expansion modifications are largely incidental; the planned ablation may reduce the methodological contribution from 'three modifications' to 'one filter heuristic.'
Circularity Check
No significant circularity: an empirical SFT report whose claims are externally benchmarked; the real risks (harness mismatch, KG↔benchmark overlap) are correctness/comparability concerns, not circular derivations.
full rationale
This is a short technical report whose central claim is empirical: a 30B model SFT-trained on 10.6k synthesized trajectories scores 46.0/58.1/34.6/78.0 on four public benchmarks (BrowseComp, BrowseComp-ZH, HLE, xbench). The "derivation" chain is not a chain of theoretical equations being reduced to themselves; it is (a) a data-construction pipeline (graph expansion, tool-set expansion, low-step filter) and (b) reported scores on external, third-party benchmarks. None of the four reported numbers is fitted, renamed, or back-defined from a prior input — they are evaluations against datasets the authors did not author (Wei et al. 2025; Zhou et al. 2025; Phan et al. 2025; xbench team). That makes them externally falsifiable in the sense of Hard Rule 4: a third party can rerun the open-sourced weights against the public benchmarks. The self-citations that exist (Du et al. 2026 = OpenSeeker-v1; Ye et al. 2025) are used additively (v1 is the prior baseline that v2 improves over; AgentFold is mentioned as an alternative paradigm), not as load-bearing uniqueness theorems or imported ansätze. The "Methodology" equations (G_sub^(K) = Expand(...), q ~ P_gen(q | G_sub^(K)), τ = (r1,a1,o1,...), D_v2 = {(q,τ) : T(τ) ≥ T_min}) are notation, not derivations — nothing is "predicted" from them. The reader's and skeptic's critiques (contamination via KG/web overlap; harness mismatch in tool-call budget when comparing to baseline numbers "taken from their technical reports or public leaderboards") are real and important, but they are correctness, comparability, and evaluation-protocol concerns, not circularity as defined here. There is no step where a fitted quantity is renamed as a prediction, no self-definitional loop, and no uniqueness claim imported from the authors' own prior work. Score: 1 (one minor self-citation to OpenSeeker-v1, not load-bearing for any forced conclusion).
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.