pith. sign in

arxiv: 2511.19399 · v3 · pith:3JTHBD5Mnew · submitted 2025-11-24 · 💻 cs.CL · cs.AI· cs.LG

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

classification 💻 cs.CL cs.AIcs.LG
keywords researchdeepagentslong-formrubricslearningmodelopen
0
0 comments X
read the original abstract

Deep research agents perform multi-step research to produce long-form, well-attributed answers. However, most open deep research agents are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards, which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), where rubrics are constructed and maintained to co-evolve with the policy model during training. This allows the rubrics to incorporate newly explored information from search and contrasting model responses, enabling better fact checking and more discriminative on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first fully open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare, and general domains, DR Tulu substantially outperforms existing open deep research agents (by 15.6% over Tongyi DR on average) and matches or exceeds proprietary deep research agents (by 0.7% over OpenAI DR on average), while being significantly smaller and cheaper per query (1000x cheaper than OpenAI DR per query).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

    cs.CL 2026-05 unverdicted novelty 7.0

    DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.

  2. LLM Agents Already Know When to Call Tools -- Even Without Reasoning

    cs.CL 2026-05 conditional novelty 7.0

    LLMs encode tool necessity in pre-generation hidden states at AUROC 0.89-0.96, enabling Probe&Prefill to reduce tool calls 48% with 1.7% accuracy loss, outperforming prompt and reasoning baselines.

  3. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  4. Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

    cs.LG 2026-03 unverdicted novelty 7.0

    ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.

  5. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

    cs.AI 2026-05 unverdicted novelty 6.0

    POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.

  6. Reward Hacking in Rubric-Based Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do no...

  7. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  8. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  9. Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

    cs.CL 2026-04 unverdicted novelty 6.0

    POP bootstraps post-training signals for open-ended LLM tasks by synthesizing rubrics during self-play on pretraining corpus, yielding performance gains on Qwen-2.5-7B across healthcare QA, creative writing, and instr...

  10. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  11. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  12. Olmo Hybrid: From Theory to Practice and Back

    cs.LG 2026-04 conditional novelty 6.0

    A 7B hybrid attention-recurrent model outperforms its pure-transformer counterpart on pretraining metrics and scales more efficiently, supported by a proof that hybrids are strictly more expressive than either transfo...

  13. Self-Optimizing Multi-Agent Systems for Deep Research

    cs.IR 2026-04 unverdicted novelty 6.0

    Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.

  14. Differentiable Evolutionary Reinforcement Learning

    cs.AI 2025-12 unverdicted novelty 6.0

    DERL is a differentiable bi-level method that evolves optimal reward structures for RL policies by composing atomic primitives and using meta-gradients from validation performance.

  15. GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

    cs.CL 2026-05 unverdicted novelty 5.0

    GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass execution with modular flexibility.

  16. GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

    cs.CL 2026-05 unverdicted novelty 5.0

    GRC unifies generation, retrieval, and compression in LLMs via meta latent tokens for single-pass inference with modular flexibility.

  17. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...