ARBOR introduces a reusable rubric buffer that consolidates contrastive trajectory drafts into cross-query rubrics for online process rewards, outperforming GRPO and DAPO on multi-hop QA benchmarks.
L e TS : Learning to Think-and-Search via Process-and-Outcome Reward Hybridization
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
ARBOR introduces a reusable rubric buffer that consolidates contrastive trajectory drafts into cross-query rubrics for online process rewards, outperforming GRPO and DAPO on multi-hop QA benchmarks.