ARBOR introduces a reusable rubric buffer that consolidates contrastive trajectory drafts into cross-query rubrics for online process rewards, outperforming GRPO and DAPO on multi-hop QA benchmarks.
s3: You Don ' t Need That Much Data to Train a Search Agent via RL
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Harness-1 uses a state-externalizing harness for RL-trained search agents and reports 0.730 average curated recall, outperforming the next open subagent by 11.4 points.
citing papers explorer
-
ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
ARBOR introduces a reusable rubric buffer that consolidates contrastive trajectory drafts into cross-query rubrics for online process rewards, outperforming GRPO and DAPO on multi-hop QA benchmarks.