pith. machine review for the scientific record. sign in

arxiv: 2604.19859 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL· cs.IR

Recognition: unknown

DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.IR
keywords deep research agentssmall language modelsagentic reinforcement learningedge deploymentsupervised fine-tuninginformation gain rewardslong-horizon tasksopen data training
0
0 comments X

The pith

A 4B model trained on roughly 10K open examples through cleaned trajectories and information-gain rewards outperforms prior agents under 9B parameters on deep research tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that strong edge-scale deep research agents can be built from small language models using only limited open data. It combines strict data cleaning and resampling of long-horizon trajectories in a supervised fine-tuning stage with a reinforcement learning stage that uses turn-level rewards based on information gain and format regularization. A sympathetic reader would care because this recipe makes capable research agents feasible on devices where cost, latency, and privacy constraints rule out large models. The results indicate that 4B agents already hold substantial untapped performance potential once data quality and supervision density are addressed. Further analysis highlights the value of test-time scaling in this regime.

Core claim

DR-Venus-4B is a frontier 4B deep research agent trained entirely on open data through two stages: first, agentic supervised fine-tuning that improves data quality via strict cleaning and resampling of long-horizon trajectories, then agentic reinforcement learning that boosts execution reliability through turn-level rewards derived from information gain plus format-aware regularization; this produces performance that exceeds prior agentic models under 9B parameters on multiple deep research benchmarks and narrows the gap to 30B-class systems.

What carries the argument

The two-stage training recipe of agentic supervised fine-tuning on resampled long-horizon trajectories followed by agentic reinforcement learning with information-gain turn-level rewards and format regularization, which raises supervision density and credit assignment for small models.

If this is right

  • Small models become viable for long-horizon agent work once data cleaning, trajectory resampling, and dense turn-level rewards are applied.
  • The performance gap between 4B and 30B agents shrinks substantially on research benchmarks under this training approach.
  • Test-time scaling yields additional gains for these edge-scale agents beyond what training alone provides.
  • The same data-utilization methods can be reused to improve other small-model agentic systems that face data scarcity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the benchmarks prove representative, similar recipes could enable on-device agents that handle research without cloud transmission of user queries.
  • Data quality and reward design may prove more decisive than raw parameter count for certain agent capabilities once a minimum scale is reached.
  • The approach invites direct tests on adjacent tasks such as multi-step web navigation or code execution with comparably small models and open data only.

Load-bearing premise

The selected deep research benchmarks and evaluation protocols accurately measure real-world long-horizon agent performance and contain no overlap or leakage with the 10K training examples.

What would settle it

Running DR-Venus-4B and the prior under-9B baselines on a fresh set of deep research tasks whose content and trajectories have zero overlap with the training data or existing benchmarks, then checking whether the 4B model still leads.

read the original abstract

Edge-scale deep research agents based on small language models are attractive for real-world deployment due to their advantages in cost, latency, and privacy. In this work, we study how to train a strong small deep research agent under limited open-data by improving both data quality and data utilization. We present DR-Venus, a frontier 4B deep research agent for edge-scale deployment, built entirely on open data. Our training recipe consists of two stages. In the first stage, we use agentic supervised fine-tuning (SFT) to establish basic agentic capability, combining strict data cleaning with resampling of long-horizon trajectories to improve data quality and utilization. In the second stage, we apply agentic reinforcement learning (RL) to further improve execution reliability on long-horizon deep research tasks. To make RL effective for small agents in this setting, we build on IGPO and design turn-level rewards based on information gain and format-aware regularization, thereby enhancing supervision density and turn-level credit assignment. Built entirely on roughly 10K open-data, DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks, while also narrowing the gap to much larger 30B-class systems. Our further analysis shows that 4B agents already possess surprisingly strong performance potential, highlighting both the deployment promise of small models and the value of test-time scaling in this setting. We release our models, code, and key recipes to support reproducible research on edge-scale deep research agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DR-Venus, a 4B-parameter deep research agent trained exclusively on approximately 10K open data trajectories. It employs a two-stage training process: agentic supervised fine-tuning (SFT) with strict data cleaning and resampling of long-horizon trajectories, followed by agentic reinforcement learning (RL) using IGPO with turn-level rewards based on information gain and format-aware regularization. The central claim is that DR-Venus-4B significantly outperforms prior agentic models under 9B parameters on multiple deep research benchmarks and narrows the performance gap to larger 30B-class systems, while also releasing the models, code, and recipes for reproducibility.

Significance. If the empirical results hold after addressing data integrity concerns, the work would be significant for demonstrating that small models can achieve strong long-horizon agentic performance via targeted data curation and dense turn-level supervision in RL. This supports practical edge-scale deployment and provides reproducible artifacts (models, code, recipes) that strengthen the contribution.

major comments (2)
  1. §3 (Training Recipe) and §4 (Experiments): The central performance claims rely on the ~10K open trajectories being free of semantic or lexical overlap with the deep research benchmarks. The manuscript mentions 'strict data cleaning' and 'resampling' but reports no quantitative overlap statistics (e.g., 13-gram Jaccard indices, embedding cosine thresholds, or manual audit counts). Without these, gains could arise from partial memorization rather than the IGPO rewards or data utilization improvements, undermining attribution to the proposed methods.
  2. §4 (Experiments): The reported outperformance lacks accompanying details on exact baseline implementations, statistical significance tests, error bars across runs, and explicit train/test split descriptions. These omissions make it impossible to verify the reliability and magnitude of the claimed improvements over <9B models and the gap closure to 30B systems.
minor comments (2)
  1. The definition and implementation details of the information-gain component in the turn-level reward (likely in §3.2) would benefit from an explicit equation or pseudocode to clarify how it differs from standard outcome-based rewards.
  2. Table or figure captions in the results section should explicitly list all compared models with parameter counts and training data sources for easier cross-reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment point by point below. Where the feedback identifies areas for improved transparency, we will revise the manuscript accordingly while preserving the integrity of our original claims and experiments.

read point-by-point responses
  1. Referee: §3 (Training Recipe) and §4 (Experiments): The central performance claims rely on the ~10K open trajectories being free of semantic or lexical overlap with the deep research benchmarks. The manuscript mentions 'strict data cleaning' and 'resampling' but reports no quantitative overlap statistics (e.g., 13-gram Jaccard indices, embedding cosine thresholds, or manual audit counts). Without these, gains could arise from partial memorization rather than the IGPO rewards or data utilization improvements, undermining attribution to the proposed methods.

    Authors: We agree that quantitative overlap statistics would provide stronger evidence against memorization and better support attribution to our data curation and IGPO methods. The manuscript describes the strict cleaning and resampling steps but does not report the requested metrics. In the revised version we will add a new paragraph in §3 that includes: (i) 13-gram Jaccard index distributions across the 10K trajectories, (ii) embedding cosine similarity thresholds and rejection rates applied during filtering, and (iii) results of a manual audit on a random subset of 500 trajectories. These additions will be presented without altering the training pipeline or experimental results. revision: yes

  2. Referee: §4 (Experiments): The reported outperformance lacks accompanying details on exact baseline implementations, statistical significance tests, error bars across runs, and explicit train/test split descriptions. These omissions make it impossible to verify the reliability and magnitude of the claimed improvements over <9B models and the gap closure to 30B systems.

    Authors: We acknowledge that the current experimental section would benefit from greater detail to enable independent verification. While §4 provides an overview of baselines, splits, and results, we will expand it in the revision to include: (1) exact prompts, decoding parameters, and implementation notes for every baseline, (2) statistical significance tests (paired t-tests with p-values) comparing DR-Venus-4B against each <9B model, (3) error bars derived from three independent training runs, and (4) explicit descriptions of the train/test splits with dataset sizes and any filtering applied. These changes will be added to §4 without modifying the underlying performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training recipe evaluated on external benchmarks

full rationale

The paper presents a two-stage empirical pipeline (agentic SFT on cleaned 10K trajectories followed by IGPO-based RL with turn-level rewards) and reports benchmark performance. No equations, derivations, or 'predictions' are defined in terms of fitted parameters or self-citations by construction. Data cleaning and resampling are described as preprocessing steps external to the model outputs, and benchmark results are direct comparisons rather than quantities forced by the training process itself. The central claims remain falsifiable against held-out test sets.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about generalization from curated training trajectories to held-out benchmarks and on the effectiveness of the proposed information-gain reward design; no new mathematical entities are introduced.

free parameters (1)
  • turn-level reward scaling factors
    Weights or scaling constants for information-gain and format-regularization terms in the RL stage are chosen or tuned to produce the reported gains.
axioms (1)
  • domain assumption The selected open-data trajectories and benchmarks are representative of general deep research tasks without significant distribution shift or leakage
    All performance claims depend on this holding.

pith-pipeline@v0.9.0 · 5625 in / 1306 out tokens · 64071 ms · 2026-05-10T02:28:24.029108+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 17 canonical work pages · 6 internal anchors

  1. [1]

    H. Chen, X. Cong, S. Fan, Y . Fu, Z. Gong, Y . Lu, Y . Li, B. Niu, C. Pan, Z. Song, et al. Agentcpm-explore: Realizing long-horizon deep exploration for edge-scale agents.arXiv preprint arXiv:2602.06485, 2026a. K. Chen, Y . Ren, Y . Liu, X. Hu, H. Tian, T. Xie, F. Liu, H. Zhang, H. Liu, Y . Gong, et al. xbench: Tracking agents productivity scaling with pr...

  2. [2]

    Q. Chen, T. Qin, K. Zhu, Q. Wang, C. Y u, S. Xu, J. Wu, J. Zhang, X. Liu, X. Gui, et al. Search more, think less: Rethinking long-horizon agentic search for efficiency and generalization.arXiv preprint arXiv:2602.22675, 2026b. Z. Chu, X. Wang, J. Hong, H. Fan, Y . Huang, Y . Y ang, G. Xu, C. Zhao, C. Xiang, S. Hu, et al. Redsearcher: A scalable and cost-e...

  3. [3]

    Y . Du, R. Y e, S. Tang, X. Zhu, Y . Lu, Y . Cai, and S. Chen. Openseeker: Democratizing frontier search agents by fully open-sourcing training data.arXiv preprint arXiv:2603.15594,

  4. [4]

    Blitzer, S

    N. Gupta, R. Chatterjee, L. Haas, C. Tao, A. Wang, C. Liu, H. Oiwa, E. Gribovskaya, J. Ackermann, J. Blitzer, et al. Deepsearchqa: Bridging the comprehensiveness gap for deep research agents.arXiv preprint arXiv:2601.20975,

  5. [5]

    K. Li, Z. Zhang, H. Yin, R. Y e, Y . Zhao, L. Zhang, L. Ou, D. Zhang, X. Wu, J. Wu, et al. Websailor-v2: Bridging the chasm to proprietary agents via synthetic data and scalable reinforcement learning.arXiv preprint arXiv:2509.13305, 2025a. K. Li, Z. Zhang, H. Yin, L. Zhang, L. Ou, J. Wu, W. Yin, B. Li, Z. Tao, X. Wang, et al. Websailor: Navigating super-...

  6. [6]

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong, et al. Deepseek-v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025a. J. Liu, Y . Li, C. Zhang, J. Li, A. Chen, K. Ji, W. Cheng, Z. Wu, C. Du, Q. Xu, et al. Webexplorer: Explore and evolve for training long-horizon web agents.arXiv ...

  7. [7]

    Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  8. [8]

    Z. Shi, Y . Chen, H. Li, W. Sun, S. Ni, Y . Lyu, R.-Z. Fan, B. Jin, Y . Weng, M. Zhu, et al. Deep research: A systematic survey.arXiv preprint arXiv:2512.02038,

  9. [9]

    Q. Tang, H. Xiang, L. Y u, B. Y u, Y . Lu, X. Han, L. Sun, W. Zhang, P . Wang, S. Liu, et al. Beyond turn limits: Training deep search agents with dynamic context window.arXiv preprint arXiv:2510.08276,

  10. [10]

    K. Team, T. Bai, Y . Bai, Y . Bao, S. Cai, Y . Cao, Y . Charles, H. Che, C. Chen, G. Chen, et al. Kimi k2. 5: Visual agentic intelligence. arXiv preprint arXiv:2602.02276,

  11. [11]

    M. Team, S. Bai, L. Bing, C. Chen, G. Chen, Y . Chen, Z. Chen, Z. Chen, J. Dai, X. Dong, et al. Mirothinker: Pushing the performance boundaries of open-source research agents via model, context, and interactive scaling.arXiv preprint arXiv:2511.11793, 2025a. T. D. Team, B. Li, B. Zhang, D. Zhang, F. Huang, G. Li, G. Chen, H. Yin, J. Wu, J. Zhou, et al. To...

  12. [12]

    URLhttps://openreview.net/forum?id=qkWP6phrvZ. 14 J. Wei, Z. Sun, S. Papay, S. McKinney, J. Han, I. Fulford, H. W. Chung, A. T. Passos, W. Fedus, and A. Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents.arXiv preprint arXiv:2504.12516,

  13. [13]

    Y ang, G

    C. Y ang, G. Peng, J. Zhu, R. Le, R. Feng, T. Zhang, X. Xu, Y . Song, Y . Jia, Y . Wen, et al. Nanbeige4. 1-3b: A small general model that reasons, aligns, and acts.arXiv preprint arXiv:2602.13367,

  14. [14]

    A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471,

  15. [15]

    P . Zhou, B. Leon, X. Ying, C. Zhang, Y . Shao, Q. Y e, D. Chong, Z. Jin, C. Xie, M. Cao, et al. Browsecomp-zh: Benchmarking web browsing ability of large language models in chinese.arXiv preprint arXiv:2504.19314,

  16. [16]

    Y . Zhou, K. Zheng, Q. Chen, M. Hu, Q. Sun, C. Xu, and J. Chen. Offseeker: Online reinforcement learning is not all you need for deep research agents.arXiv preprint arXiv:2601.18467,

  17. [17]

    B. Zhu, Q. Jia, T. Lan, J. Ren, F. Gu, F. Jiang, L. Wang, Z. Xu, and W. Luo. Marco deepresearch: Unlocking efficient deep research agents via verification-centric design.arXiv preprint arXiv:2603.28376, 2026a. J. Zhu, G. Zhang, X. Ma, L. Xu, M. Zhang, R. Y ang, S. Wang, K. Qiu, Z. Wu, Q. Dai, et al. Re-trac: Recursive trajectory compression for deep searc...

  18. [18]

    type": "function

    and Tongyi DeepResearch (Team et al., 2025b), and are further refined to better support the reasoning and interaction patterns. System Prompt of DR-V enus Y ou are a deep research assistant. Y our core function is to conduct thorough, multi-source investigations into any topic. Y ou must handle both broad, open-domain inquiries and queries within speciali...