Find- ing evidence for reasoning step 1

Refer to the previous dialogue records in the history, including the user's queries, previous`<tool_call>`,`<response>`, any tool feedback noted as`<obs>`(if exists) · 2040

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

cs.LG · 2026-05-21 · conditional · novelty 6.0

VPO modifies the GRPO advantage estimator to train LLMs for diversity across vector reward trade-offs, matching or exceeding scalar RL baselines on test-time search with larger gains at higher search budgets.

citing papers explorer

Showing 1 of 1 citing paper.

Vector Policy Optimization: Training for Diversity Improves Test-Time Search cs.LG · 2026-05-21 · conditional · none · ref 10
VPO modifies the GRPO advantage estimator to train LLMs for diversity across vector reward trade-offs, matching or exceeding scalar RL baselines on test-time search with larger gains at higher search budgets.

Find- ing evidence for reasoning step 1

fields

years

verdicts

representative citing papers

citing papers explorer