OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Keduan Huang; Rui Ye; Shuo Tang; Siheng Chen; Xinyu Zhu; Yuwen Du; Yuzhu Cai

arxiv: 2605.04036 · v1 · submitted 2026-05-05 · 💻 cs.AI · cs.CL

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Yuwen Du , Rui Ye , Shuo Tang , Keduan Huang , Xinyu Zhu , Yuzhu Cai , Siheng Chen This is my paper

Pith reviewed 2026-05-06 05:14 UTC · model claude-opus-4-7

classification 💻 cs.AI cs.CL

keywords search agentssupervised fine-tuningReActsynthetic trajectoriesknowledge graph data synthesislong-horizon reasoningdeep research benchmarkstool use

0 comments

The pith

A 30B search agent trained with supervised fine-tuning alone on 10.6k carefully filtered trajectories matches and beats agents built with the full continual-pretraining plus reinforcement-learning pipeline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether the multi-stage training recipe used by industrial labs to build deep-search agents — continual pre-training, supervised fine-tuning, and reinforcement learning — is actually necessary, or whether the heavy lifting is being done by data quality. The authors answer in favor of data: with three changes to a synthetic-trajectory pipeline, a single SFT pass on a 30B model produces an agent that outperforms 30B competitors trained with the full pipeline on four agentic benchmarks. The three changes are scaling the source knowledge graph so questions require deeper multi-hop reasoning, expanding the available tool set so the agent sees more interaction patterns, and filtering out any synthesized trajectory that solves in fewer than a minimum number of tool calls. The resulting training set is small (10.6k trajectories) but skews long: average 64.7 tool-call steps per trajectory versus 47 and 36 for prior comparable datasets. A sympathetic reader should care because, if the result holds up, it shifts the bottleneck in agent training from compute-heavy RL stages to the curation of difficult, long-horizon synthetic data — something academic groups can plausibly do.

Core claim

The authors argue that the heavy industrial recipe for training deep-search agents — continual pre-training, supervised fine-tuning, and reinforcement learning stacked together — is not actually required to reach state-of-the-art at the 30B scale. Three targeted changes to the synthetic training data are enough: enlarge the knowledge-graph neighborhood used to seed each question so that answers genuinely require multi-hop evidence; broaden the tool set so agents learn diverse interaction patterns; and discard any trajectory that solves in too few tool calls, enforcing a minimum-difficulty floor. With only 10.6k such trajectories and pure SFT on a 30B Qwen3 thinking model, the resulting agent

What carries the argument

A synthetic trajectory pipeline with three knobs: (1) a graph-expansion budget K that grows the evidence subgraph around each seed node so generated questions structurally demand multi-hop aggregation; (2) an enlarged action set A from which ReAct trajectories are drawn; (3) a hard threshold T_min on tool-call count that discards shallow trajectories before SFT. The agent itself is a ReAct loop over Qwen3-30B-A3B-Thinking-2507 with a 256k context and up to 200 tool calls.

If this is right

State-of-the-art deep-search behavior at 30B does not require reinforcement learning, only sufficiently long and difficult supervised trajectories.
Trajectory length distribution (average tool-call count) is a usable proxy for training-data difficulty when curating search-agent corpora.
Filtering out short-trajectory examples — even at the cost of dataset size — improves long-horizon search performance more than adding easy data.
Academic-scale teams can produce competitive frontier search agents if they invest in graph-driven question synthesis rather than RL infrastructure.
Knowledge-graph expansion budget and tool-set size are tunable levers for question difficulty in synthetic agent data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 64.7-step average trajectory length suggests the model is being taught a particular search rhythm; benchmarks that reward shorter solutions (or penalize tool-call count) would likely show a smaller margin or reversal.
Because only the v1→v2 endpoints are compared, the relative contributions of graph scaling, tool-set expansion, and low-step filtering are entangled; the low-step filter alone could plausibly account for most of the gain, since it is the only modification that directly raises the difficulty floor of every retained example.
If the underlying base model (Qwen3-30B-A3B-Thinking-2507) already encodes much of the reasoning capacity, the result may generalize less well to base models without a strong native thinking mode, making the recipe partly a story about pairing difficult SFT data with thinking-tuned bases.
The approach implies a natural curriculum extension: progressively raise T_min during training rather than applying a single floor, which could push performance further without growing dataset size.

Load-bearing premise

That the benchmark gains reflect the agent learning to search and reason in general, rather than the synthetic training questions being drawn from the same kinds of web sources and structures the benchmarks themselves probe.

What would settle it

Run a contamination and distribution audit: rebuild the 10.6k trajectories from a knowledge graph seeded only from sources demonstrably disjoint from BrowseComp, BrowseComp-ZH, HLE, and xbench evidence, and retrain. If the four-benchmark scores fall back to the v1 range (around 30 on BrowseComp, 48 on BC-ZH, 74 on xbench), the gains were dataset-similarity rather than data-difficulty. A second falsifier: ablate each of the three modifications separately; if no single one moves the needle, the v1→v2 jump must be attributed to something else.

Figures

Figures reproduced from arXiv: 2605.04036 by Keduan Huang, Rui Ye, Shuo Tang, Siheng Chen, Xinyu Zhu, Yuwen Du, Yuzhu Cai.

**Figure 1.** Figure 1: OpenSeeker-v2 achieves state-of-the-art performance within its model scale and paradigm on view at source ↗

**Figure 2.** Figure 2: Comparison of average tool call counts across search-agent training data. OpenSeeker-v2 demonstrates higher data difficulty than prior counterparts. OpenSeeker-v2 is built upon substantially longer search trajectories, with an average of 64.67 steps per trajectory, compared with 46.97 for OpenSeeker-v1 and 36.01 for RedSearcher. This suggests that the OpenSeeker-v2 training data requires more complex mult… view at source ↗

read the original abstract

Deep search capabilities have become an indispensable competency for frontier Large Language Model (LLM) agents, yet their development remains dominated by industrial giants. The typical industry recipe involves a highly resource-intensive pipeline spanning pre-training, continual pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning (RL). In this report, we show that when fueled with informative and high-difficulty trajectories, a simple SFT approach could be surprisingly powerful for training frontier search agents. By introducing three simple data synthesis modifications: scaling knowledge graph size for richer exploration, expanding the tool set size for broader functionality, and strict low-step filtering, we establish a stronger baseline. Trained on merely 10.6k data points, our OpenSeeker-v2 achieves state-of-the-art performance across 4 benchmarks (30B-sized agents with ReAct paradigm): 46.0% on BrowseComp, 58.1% on BrowseComp-ZH, 34.6% on Humanity's Last Exam, and 78.0% on xbench, surpassing even Tongyi DeepResearch trained with heavy CPT+SFT+RL pipeline, which achieves 43.4%, 46.7%, 32.9%, and 75.0%, respectively. Notably, OpenSeeker-v2 represents the first state-of-the-art search agent within its model scale and paradigm to be developed by a purely academic team using only SFT. We are excited to open-source the OpenSeeker-v2 model weights and share our simple yet effective findings to make frontier search agent research more accessible to the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Useful engineering report with verifiable weights, but the headline "SFT rivals RL" comparison is not made under controlled conditions.

read the letter

Quick read on this one. It's a short technical report from the SJTU group announcing OpenSeeker-v2: a Qwen3-30B-A3B checkpoint SFT-trained on 10.6k synthetic ReAct trajectories that posts the highest reported numbers among ~30B ReAct search agents on BrowseComp (46.0), BC-ZH (58.1), HLE (34.6), and xbench (78.0). Weights are on HF. That last bit matters — the headline numbers are third-party verifiable, which is more than most entries in Table 1 can say.

What's actually new: not the methodology. The three "modifications" (bigger seed subgraph, larger tool set, low-step trajectory filtering) are all standard moves from WebSailor/RedSearcher/Tongyi DR, and the paper concedes as much. What's new is the artifact and the empirical claim that with sufficiently long, hard synthetic trajectories you can SFT your way to numbers that match or beat CPT+SFT+RL pipelines at this scale. The 11.4-point BC-ZH margin over Tongyi is large enough to be interesting on its face. Releasing the weights is the right move and earns them credit.

Soft spots, in proportion:

1. The cross-team comparison is not apples-to-apples. Baselines are lifted from other groups' tech reports. OpenSeeker-v2 runs with a 256k context and up to 200 tool calls per trajectory, and Figure 2 brags that training trajectories average 64.67 steps vs RedSearcher's 36. If inference inherits that profile, the 2.6-point BrowseComp gap and the 0.3-point HLE gap are well within what extra browsing budget alone buys you. The stress-test reader is right that the smaller margins do not survive harness-mismatch scrutiny. The BC-ZH and xbench gaps probably do.

2. No ablation. K, |A|, and T_min are not even reported as numbers. We get a v1→v2 endpoint comparison and that's it. So we cannot say which of the three "simple modifications" is doing the work, or whether it's just longer trajectories.

3. Contamination audit is one sentence about masking HF links. The KG-derived synthesis pipeline draws from the same web sources the benchmarks draw from. Not damning, but worth a paragraph.

4. No error bars, no multiple seeds.

None of this is load-bearing dishonesty — the paper is fairly transparent about being a short report and about where baselines come from. It just oversells the methodological conclusion.

Recommendation: worth engaging with. Send to peer review with a request for a matched-harness rerun against at least Tongyi DR and RedSearcher, plus a basic ablation. The artifact is real; the framing needs tightening.

Referee Report

4 major / 7 minor

Summary. The report introduces OpenSeeker-v2, a 30B-parameter ReAct search agent obtained by supervised fine-tuning Qwen3-30B-A3B-Thinking-2507 on 10.6k synthetic trajectories. Three modifications to the v1 data-synthesis pipeline are proposed: (i) enlarging the seed knowledge-graph subgraph used to generate queries, (ii) expanding the available tool set, and (iii) discarding trajectories with fewer than T_min tool calls. The authors report 46.0 / 58.1 / 34.6 / 78.0 on BrowseComp / BrowseComp-ZH / HLE / xbench, claiming SOTA among ~30B ReAct-paradigm agents and surpassing CPT+SFT+RL systems (Tongyi DeepResearch, RedSearcher) under the same scale. The headline conclusion is that careful data curation alone, without RL or CPT, can match or exceed heavier industrial pipelines.

Significance. If the comparative claim holds under matched evaluation conditions, this is a genuinely useful negative result against the necessity of CPT+RL stages for 30B-scale ReAct search agents, and the released weights + 10.6k-trajectory recipe would lower the barrier to entry for academic groups. The paper's strengths to credit explicitly: (a) full open-sourcing of weights and code, (b) a small, auditable training set (10.6k trajectories), (c) explicit acknowledgement and partial mitigation of HF link leakage during evaluation, and (d) a clear, falsifiable headline number that others can re-run. The methodological novelty is modest — the three modifications are individually unsurprising — but the empirical scaling evidence (v1 → v2 with the same recipe) is informative if interpretable.

major comments (4)

[§2.2 Baselines / Table 1] The central comparative claim relies on baseline numbers 'taken from their technical reports or public leaderboards', while OpenSeeker-v2 itself is run with a 256k context and up to 200 tool calls per trajectory (§2.2 Implementation). BrowseComp and HLE scores are known to scale with browsing budget, and Figure 2 documents that the v2 *training* trajectories average 64.67 tool calls (vs 36.01 for RedSearcher). The 2.6-point BrowseComp gap and the 0.3-point HLE gap over Tongyi DeepResearch / RedSearcher are within the range plausibly attributable to harness, search backend, retry/timeout, and tool-budget differences. Without a matched-harness rerun of at least one CPT+SFT+RL baseline (ideally Tongyi DeepResearch, whose weights are public) inside the same sandbox and tool-call cap as OpenSeeker-v2, the headline 'SFT rivals RL' cannot be cleanly read off Table 1. Please add such a rerun, or
[§2.1 Methodology] Three modifications are introduced (graph scaling K, tool-set expansion |A|, low-step filter T_min), but no ablation isolates their individual contributions. Only an endpoint v1→v2 comparison is shown. As written, the report cannot distinguish whether gains come from (i) longer/multi-hop queries, (ii) more tool diversity, (iii) the difficulty-floor filter, or (iv) some combination — and in particular cannot rule out that the low-step filter alone (which strongly biases the training distribution toward long-horizon trajectories matching the BrowseComp/HLE evaluation regime) accounts for most of the lift. Please report at minimum three single-modification ablations off a v1 base, plus the values of K, |A|, and T_min used (none are stated numerically in §2.1).
[§2.2 Benchmarks] Contamination control is described in a single sentence: 'We mask the hugging-face-related links when calling the web search tools to avoid potential leakage.' BrowseComp/BC-ZH/HLE/xbench items are public and indexed on many non-HF domains; the synthesis pipeline itself draws evidence from web tools over a knowledge graph, i.e. from the same evidence ecosystem the benchmarks query. Please report (a) an n-gram or embedding-based overlap audit between the 10.6k synthesized queries and each benchmark's questions/answers, and (b) per-benchmark performance after excluding any synthesized item whose evidence subgraph overlaps a benchmark item's gold sources. Without this, distributional contamination is a live alternative explanation for the magnitude of the gain over v1, particularly the +16.5 jump on BrowseComp.
[Table 1 / §2.3] The 'first SOTA by a purely academic team using only SFT' framing is load-bearing for the contribution but is not invariant to the comparison set. WebSailor-V2-30B-SFT (24.4 BrowseComp) is cited, but 'SFT-only' upper bounds for several other listed systems are not. Please clarify whether v2's SFT data quantity (10.6k) was held fixed when comparing to v1 (11.7k), and whether the gain attributable to switching base model checkpoints or sampling temperature has been ruled out — the report does not state whether v1 and v2 share the exact same base, optimizer, and inference decoding settings.

minor comments (7)

[Abstract / §1] The abstract describes 'three simple data synthesis modifications' but the prose in §1 then says 'two simple yet highly effective modifications' before listing three. Please reconcile.
[§2.1] Symbols K, k, |A|, T_min, and the size of D_raw vs D_v2 are introduced abstractly but never assigned numerical values. Please specify the actual hyperparameters used.
[§2.2 Benchmarks] The text says 'five challenging agentic benchmarks' but only four are listed (BrowseComp, BC-ZH, HLE, xbench).
[Figure 2] The figure compares *training* tool-call distributions, but the caption ('average tool call counts across search-agent training data') and the surrounding prose slide between training-trajectory length and inference-time behavior. Please add the inference-time tool-call distribution on each benchmark for OpenSeeker-v2 and at least one re-run baseline; this is also directly relevant to the harness-mismatch concern in major comment 1.
[References] Several arXiv identifiers appear post-dated (e.g. 2602.x, 2603.x, 2605.04036 itself, 2606.x). If this is intentional, a footnote at first occurrence would prevent reader confusion.
[§2.3] 'OpenSeeker-v2 outperforms ... DeepSeek-V3.1-671B, GLM-4.6-357B, Minimax-M2-230B, Claude-4.5-Sonnet' — these are general-purpose models not necessarily configured as ReAct search agents in their reported numbers. The comparison would be cleaner if framed as 'agentic-benchmark scores under each model's reported configuration' with the caveat stated.
[Table 1] The '# Samples' column is '?' for most rows, including the CPT+SFT+RL competitors. Where these numbers are public (Tongyi DeepResearch report), please populate; where not, a footnote stating they are undisclosed would be clearer than '?'.

Simulated Author's Rebuttal

4 responses · 2 unresolved

We thank the referee for a careful and constructive report. The referee correctly identifies that our headline claim — that SFT-only training on 10.6k high-difficulty trajectories can rival CPT+SFT+RL pipelines at the 30B/ReAct scale — depends critically on (i) matched-harness comparison, (ii) attribution of the gain to specific data-pipeline modifications, (iii) a defensible contamination audit, and (iv) a clean v1→v2 controlled comparison. We agree that the current report under-supports these four pillars and we will revise accordingly. In brief, we will (a) rerun Tongyi DeepResearch (whose weights are public) inside our own sandbox under our 256k context / 200-tool-call budget; (b) add three single-modification ablations off the v1 base, and report the numerical values of K, |A|, and T_min; (c) add an n-gram and embedding overlap audit between the 10.6k synthesized queries and each benchmark, plus a "decontaminated" rerun excluding overlapping items; (d) explicitly state the base checkpoint, optimizer, and decoding settings used for both v1 and v2, and report a v2-recipe-on-v1-base / v1-recipe-on-v2-base cross-check. We acknowledge that until these are in place, the "SFT rivals RL" framing should be presented as a hypothesis supported by the current numbers rather than as a settled comparative claim, and we will soften the abstract and §2.3 wording accordingly. We retain the report's core empirical contribution — the v1→v2 jump under a fixed recipe — as the most defensible

read point-by-point responses

Referee: Headline 'SFT rivals RL' relies on baseline numbers from technical reports/leaderboards while OpenSeeker-v2 runs with 256k context and up to 200 tool calls; gaps of 2.6 / 0.3 points may be within harness variance. Please add a matched-harness rerun of at least one CPT+SFT+RL baseline (ideally Tongyi DeepResearch).

Authors: We agree this is the central methodological gap. The 2.6-point BrowseComp and 0.3-point HLE margins are indeed within plausible harness/budget variance, and we should not draw the 'SFT rivals RL' conclusion from numbers collected under heterogeneous conditions. For the revision we commit to: (1) rerunning Tongyi DeepResearch (public weights) inside our own sandbox, with the identical search backend, retry/timeout policy, 256k context, and 200-tool-call cap used for OpenSeeker-v2, and reporting the resulting numbers in a new column of Table 1; (2) where feasible, doing the same for WebSailor-V2-30B-RL (also public). We will additionally report OpenSeeker-v2 under a reduced tool-call cap (e.g., 64) matching the average trajectory length of the comparison systems' training data, so readers can see how much of the margin survives a tighter budget. If the matched-harness rerun shrinks the BrowseComp/HLE gaps to within noise, we will explicitly retract the 'surpassing CPT+SFT+RL' framing in the abstract and §2.3 and replace it with the weaker, defensible claim that SFT-only training is competitive with CPT+SFT+RL at this scale on BC-ZH and xbench, where the margins (11.4 and 3.0 points) are large enough to plausibly survive harness normalization. revision: yes
Referee: No ablation isolates the contributions of graph scaling K, tool-set expansion |A|, or low-step filter T_min; numerical values of K, |A|, T_min are not stated. The low-step filter alone may explain most of the lift by biasing toward long-horizon trajectories matching BrowseComp/HLE.

Authors: The referee is correct that the report only shows endpoint v1→v2. We will add to §2.1: (a) the numerical values used (K, |A|, T_min, plus the v1 counterparts), which are currently in our codebase but were elided from the report; (b) three single-modification ablations off the v1 base — v1+K only, v1+|A| only, v1+T_min only — each trained for the same number of steps as v2, evaluated on all four benchmarks; and (c) the leave-one-out counterparts (v2 minus each modification). We agree that the low-step filter is the most likely single dominant factor, precisely because it shifts the training distribution toward long-horizon trajectories that resemble the BrowseComp/HLE regime, and we will state this hypothesis explicitly and let the ablation adjudicate it. If T_min alone reproduces most of the gain, this is itself an interesting and publishable finding — but it changes the narrative from 'three complementary modifications' to 'difficulty-floor filtering is what matters,' and we will rewrite §2.1 and §3 accordingly. revision: yes
Referee: Contamination control is one sentence (masking HF links). Benchmarks are indexed on many non-HF domains, and the synthesis pipeline draws from the same web evidence ecosystem. Please add (a) n-gram/embedding overlap audit between 10.6k synthesized queries and each benchmark, and (b) a per-benchmark rerun excluding synthesized items whose evidence subgraph overlaps benchmark gold sources.

Authors: We accept this concern in full. HF-link masking addresses only one well-known leakage channel and does not bound distributional contamination. For the revision we will add an Appendix containing: (1) a 13-gram exact-match audit between every synthesized query/answer and the public questions/answers of BrowseComp, BC-ZH, HLE, and xbench-DeepSearch; (2) an embedding-based nearest-neighbor audit (using a strong sentence encoder) reporting the distribution of cosine similarities and a flagged set above a conservative threshold; (3) a URL/domain-level overlap audit between the evidence subgraphs used to synthesize each training item and the gold-source URLs published with the benchmarks where available; and (4) a 'decontaminated' retraining of OpenSeeker-v2 with all flagged items removed, with the four benchmark numbers reported alongside the original. We agree that, until these results are presented, distributional contamination remains a live alternative explanation for the +16.5 BrowseComp jump from v1, and we will say so explicitly in §2.3. revision: yes
Referee: Clarify whether v2 SFT data quantity (10.6k) was held fixed against v1 (11.7k), and whether gains attributable to base-model checkpoint or sampling temperature have been ruled out. The report does not state whether v1 and v2 share base, optimizer, and inference decoding settings.

Authors: Thank you — these details should have been in the report. v2 is instantiated from Qwen3-30B-A3B-Thinking-2507; we will state explicitly in the revision whether v1 used the identical base checkpoint (it did, in our internal pipeline) and report the full optimizer settings (lr, schedule, batch size, epochs) and inference decoding (temperature, top-p, max output, tool-call cap) used for both. The data quantities (11.7k vs 10.6k) were not held fixed; we will add a controlled comparison at matched 10.6k by subsampling v1 to 10.6k and retraining, and conversely at 11.7k by extending v2 to 11.7k, so that quantity is removed as a confound. We will also rerun v1 evaluation with v2's exact inference settings to rule out a decoding-level shift. Regarding the 'first SOTA by a purely academic team using only SFT' framing: we will retain it only if it survives the matched-harness rerun (Major 1) and the decontamination rerun (Major 3); otherwise we will weaken it to 'a strong SFT-only academic baseline' and let readers judge. revision: yes

standing simulated objections not resolved

We cannot, within the rebuttal window, guarantee that the matched-harness rerun of Tongyi DeepResearch will preserve the 2.6-point BrowseComp and 0.3-point HLE margins; if it does not, the 'surpassing CPT+SFT+RL' claim in the abstract will need to be withdrawn rather than defended.
We cannot rule out a priori that the low-step filter T_min is the single dominant factor and that the graph-scaling and tool-set-expansion modifications are largely incidental; the planned ablation may reduce the methodological contribution from 'three modifications' to 'one filter heuristic.'

Circularity Check

0 steps flagged

No significant circularity: an empirical SFT report whose claims are externally benchmarked; the real risks (harness mismatch, KG↔benchmark overlap) are correctness/comparability concerns, not circular derivations.

full rationale

This is a short technical report whose central claim is empirical: a 30B model SFT-trained on 10.6k synthesized trajectories scores 46.0/58.1/34.6/78.0 on four public benchmarks (BrowseComp, BrowseComp-ZH, HLE, xbench). The "derivation" chain is not a chain of theoretical equations being reduced to themselves; it is (a) a data-construction pipeline (graph expansion, tool-set expansion, low-step filter) and (b) reported scores on external, third-party benchmarks. None of the four reported numbers is fitted, renamed, or back-defined from a prior input — they are evaluations against datasets the authors did not author (Wei et al. 2025; Zhou et al. 2025; Phan et al. 2025; xbench team). That makes them externally falsifiable in the sense of Hard Rule 4: a third party can rerun the open-sourced weights against the public benchmarks. The self-citations that exist (Du et al. 2026 = OpenSeeker-v1; Ye et al. 2025) are used additively (v1 is the prior baseline that v2 improves over; AgentFold is mentioned as an alternative paradigm), not as load-bearing uniqueness theorems or imported ansätze. The "Methodology" equations (G_sub^(K) = Expand(...), q ~ P_gen(q | G_sub^(K)), τ = (r1,a1,o1,...), D_v2 = {(q,τ) : T(τ) ≥ T_min}) are notation, not derivations — nothing is "predicted" from them. The reader's and skeptic's critiques (contamination via KG/web overlap; harness mismatch in tool-call budget when comparing to baseline numbers "taken from their technical reports or public leaderboards") are real and important, but they are correctness, comparability, and evaluation-protocol concerns, not circularity as defined here. There is no step where a fitted quantity is renamed as a prediction, no self-definitional loop, and no uniqueness claim imported from the authors' own prior work. Score: 1 (one minor self-citation to OpenSeeker-v1, not load-bearing for any forced conclusion).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

[{"axiom": "Benchmark scores reported by third-party technical reports are comparable across different tool sandboxes and harness configurations", "kind": "domain_assumption", "rationale": "Baselines are taken from technical reports/leaderboards; OpenSeeker-v2 and Tongyi DeepResearch may not have run under identical search-tool backends, time limits, or scoring rubrics."}, {"axiom": "Synthetic trajectories generated from a knowledge graph do not overlap meaningfully with held-out benchmark questions", "kind": "domain_assumption", "rationale": "No contamination audit is described beyond masking huggingface links during evaluation."}, {"axiom": "Qwen3-30B-A3B-Thinking-2507's pre-training is treated as fixed input", "kind": "domain_assumption", "rationale": "The SFT-only claim depends on the base model already having strong reasoning/tool-use priors; the comparison to CPT+SFT+RL pipelines understates that the base model itself has been heavily trained."}]

pith-pipeline@v0.9.0 · 20382 in / 6925 out tokens · 108742 ms · 2026-05-06T05:14:08.993552+00:00 · methodology

OpenSeeker-v2: Pushing the Limits of Search Agents with Informative and High-Difficulty Trajectories

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)