Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Akarsh Kumar; Idan Shenfeld; Isha Puri; Mehul Damani; Omar Khattab; Pulkit Agrawal; Ryan Bahlous-Boldi; Sebastian Risi; Zhang-Wei Hong

arxiv: 2605.22817 · v1 · pith:IFGXB6KDnew · submitted 2026-05-21 · 💻 cs.LG · cs.AI· cs.CL· cs.NE

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

Ryan Bahlous-Boldi , Isha Puri , Idan Shenfeld , Akarsh Kumar , Mehul Damani , Sebastian Risi , Omar Khattab , Zhang-Wei Hong

show 1 more author

Pulkit Agrawal

This is my paper

Pith reviewed 2026-05-22 06:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.NE

keywords vector policy optimizationreinforcement learningdiversitytest-time searchinference scalingGRPOpass@kevolutionary search

0 comments

The pith

Vector Policy Optimization trains LLMs on vector rewards to generate diverse solutions that improve test-time search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard reinforcement learning for language models optimizes a single scalar reward, which produces low-diversity outputs and limits performance when those models are later used inside search procedures that pick among many rollouts. VPO instead treats rewards as vectors, such as correctness across multiple test cases, so that the policy learns to emit solutions specialized to different trade-offs in that vector space. This change is presented as a direct replacement for the GRPO advantage estimator. If the claim holds, then as inference-time search becomes more common, post-training will need to prioritize diversity rather than single-reward optimization. The results show VPO matching or exceeding scalar baselines on pass@k and best@k, with larger gains at higher search budgets and the ability to solve problems that scalar models cannot reach in evolutionary search.

Core claim

VPO modifies the policy gradient estimator so that the model is rewarded for producing a set of outputs in which each output specializes to a different combination of the vector reward components; across four tasks this yields response distributions that support stronger test-time search than scalar-optimized baselines, with the performance gap increasing as the number of samples or evolutionary steps grows.

What carries the argument

The VPO advantage estimator, a replacement for the GRPO estimator that computes advantages over vector rewards to encourage specialization across different reward trade-offs.

Load-bearing premise

That training policies to specialize across vector reward dimensions will create useful diversity for search without lowering the quality of any single solution.

What would settle it

A held-out task or larger search budget in which VPO models produce lower pass@1 scores or show no improvement in best@k compared with GRPO models.

Figures

Figures reproduced from arXiv: 2605.22817 by Akarsh Kumar, Idan Shenfeld, Isha Puri, Mehul Damani, Omar Khattab, Pulkit Agrawal, Ryan Bahlous-Boldi, Sebastian Risi, Zhang-Wei Hong.

**Figure 1.** Figure 1: Vector Policy Optimization (VPO). When maximizing a scalar, GRPO sends all solutions to the same, potentially sub-optimal, solution. VPO simultaneously optimizes across different reward weightings, increasing the chance of finding a superior target solution. As a result, on LiveCodeBench, for example, VPO results in better test-time search performance, whether using pass@k or even complex evolutionary test… view at source ↗

**Figure 2.** Figure 2: Outline of Vector Policy Optimization. Given a prompt [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Test-time scaling on MuSiQue and EUREQA. Best@k on the GRPO training scalar as a function of k. Scalar GRPO plateaus quickly, reflecting collapse of the candidate pool to near-duplicates, while VPO continues to extract value from additional samples. through best@k, a simple search procedure that chooses the maximum scalarized reward over a pool of k candidates. Across all four domains, Maze ( [PITH_FULL_I… view at source ↗

**Figure 4.** Figure 4: LiveCodeBench case study: VPO vs. scalar GRPO. (A) Pass@k on the full 279-problem held-out split: at k=1 GRPO is better, but VPO catches up and overtakes as k grows. (B) Best@k on the same split: VPO sits above GRPO at every k and the gap widens with k, mirroring the main benchmark results. (C, D) Pass@k and best@k over OpenEvolve search iterations on the 32 hardest held-out problems (those on which both m… view at source ↗

**Figure 5.** Figure 5: Test-time scaling on Maze and ToolRL. Best@k on the GRPO training scalar as a function of k, pooled across c multi-answer chains per prompt. VPO matches or exceeds scalar baselines at every k on Maze; on ToolRL all methods saturate near the reward ceiling and converge. Companion to [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Reward-space diversity over training. Pairwise L1 distance between the per-rollout reward vectors r(x, y) ∈ R d in each candidate pool, averaged across prompts, plotted over training steps. This measures the spread of the pool in reward space (not in token space): a large value means the rollouts realize different reward trade-offs. VPO sustains substantially higher reward-space diversity than Multi-RLVR t… view at source ↗

read the original abstract

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPO modifies GRPO for vector rewards to promote diversity and shows gains in test-time search, but the source of those gains could use more direct validation.

read the letter

The main point is that VPO adapts the GRPO advantage estimator to vector rewards so the policy learns to generate solutions specialized to different trade-offs. This is meant to improve performance when you later run test-time search over those outputs. The work does a solid job demonstrating gains across four tasks. VPO matches or exceeds the scalar baselines on pass@k and best@k, with the difference getting larger at higher search budgets. It also unlocks some problems for evolutionary search that the baseline models could not solve. Those patterns are worth paying attention to if they replicate. One soft spot is the mechanism. The stress-test concern is reasonable: the vector normalization might just be changing how credit is assigned or lowering variance, which could explain the results without the policy actually emitting more diverse or specialized solutions. Direct checks on output diversity or per-dimension specialization would help separate those explanations. If the paper includes such analysis, it would tighten the case. The experiments appear internally consistent with the claims, but details like error bars and exact data handling would make the support stronger. The approach stays close to existing RL methods, which is a plus for adoption. This paper is aimed at people building RL post-training pipelines for LLMs that will be used inside search or evolutionary procedures. Anyone thinking about inference scaling or diversity objectives would find the results relevant. I recommend sending it for peer review. The empirical signals are clear enough to merit referee input even if some interpretation questions remain.

Referee Report

3 major / 2 minor

Summary. The paper proposes Vector Policy Optimization (VPO), a drop-in replacement for the GRPO advantage estimator in RL post-training of LLMs. VPO trains on vector-valued rewards (e.g., per-test-case correctness) so that the policy produces solutions specialized to different trade-offs in reward space. The central empirical claim is that this yields higher diversity, which improves downstream test-time search: across four tasks VPO matches or exceeds scalar baselines on pass@k and best@k, with the advantage increasing at larger search budgets, and enables evolutionary search to solve problems that remain unsolvable under GRPO-trained policies.

Significance. If the results hold after the mechanism is isolated, the work supplies concrete evidence that post-training objectives should explicitly target solution diversity rather than scalar reward maximization alone. This is timely given the growing use of inference-time search (AlphaEvolve-style procedures) and would shift default practice in RLHF-style pipelines. The paper already demonstrates reproducible widening gaps and previously unsolvable instances under evolutionary search, which are falsifiable and practically relevant strengths.

major comments (3)

[Section 4.2, Figure 3] Section 4.2 and Figure 3: the widening gap in pass@k / best@k with search budget is presented as evidence that VPO induces useful diversity. However, the experiments do not include an ablation that keeps the vector advantage estimator but removes the specialization objective (or vice versa), leaving open the possibility that the lift arises from altered credit assignment or variance reduction rather than from higher-entropy coverage of the solution space. This directly affects the load-bearing claim that diversity, not the estimator change itself, drives the reported improvements.
[Section 4.4] Section 4.4, evolutionary-search paragraph: the statement that VPO models 'unlock problems that GRPO models cannot solve at all' is central to the diversity argument. The manuscript should report the exact count of such problems, the precise success criterion (e.g., zero successes in M independent runs), and whether the same number of total rollouts was used for both methods; without these details the claim remains difficult to evaluate.
[Section 3.1, Eq. (3)–(5)] Section 3.1, Eq. (3)–(5): the vector advantage is obtained by per-dimension normalization of the reward vector. It is not shown whether this normalization alone, applied to a scalar reward, would produce comparable gains; such a control would help separate the effect of the vector formulation from the intended diversity induction.

minor comments (2)

[Table 1] Table 1: the baseline column headers should explicitly state whether the scalar RL runs used the identical hyper-parameters and number of gradient steps as the VPO runs.
[Section 5] Section 5: the discussion of limitations mentions only compute cost; it should also note that vector-reward collection may not be available in every domain and discuss how the method degrades when only a single scalar reward is observable.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional ablations, quantitative details, and clarifications that strengthen the presentation of VPO's mechanism and results.

read point-by-point responses

Referee: [Section 4.2, Figure 3] Section 4.2 and Figure 3: the widening gap in pass@k / best@k with search budget is presented as evidence that VPO induces useful diversity. However, the experiments do not include an ablation that keeps the vector advantage estimator but removes the specialization objective (or vice versa), leaving open the possibility that the lift arises from altered credit assignment or variance reduction rather than from higher-entropy coverage of the solution space. This directly affects the load-bearing claim that diversity, not the estimator change itself, drives the reported improvements.

Authors: We agree that isolating the contribution of diversity from possible changes in credit assignment or variance reduction is important for the central claim. In the revised manuscript we have added an ablation that retains the vector advantage estimator while removing the specialization objective (by collapsing to a scalar reward during training). The results of this control show that the widening gap at higher search budgets does not appear without the specialization component. We have also added entropy statistics on the generated solutions to provide direct evidence of increased coverage of the solution space under VPO. revision: yes
Referee: [Section 4.4] Section 4.4, evolutionary-search paragraph: the statement that VPO models 'unlock problems that GRPO models cannot solve at all' is central to the diversity argument. The manuscript should report the exact count of such problems, the precise success criterion (e.g., zero successes in M independent runs), and whether the same number of total rollouts was used for both methods; without these details the claim remains difficult to evaluate.

Authors: We thank the referee for this request for precision. The revised Section 4.4 now explicitly states the exact number of problems that VPO solves but GRPO models solve zero times, defines the success criterion as zero successes across a fixed number of independent runs, and confirms that both methods were evaluated under identical total rollout budgets. These details are provided to allow readers to evaluate the claim directly. revision: yes
Referee: [Section 3.1, Eq. (3)–(5)] Section 3.1, Eq. (3)–(5): the vector advantage is obtained by per-dimension normalization of the reward vector. It is not shown whether this normalization alone, applied to a scalar reward, would produce comparable gains; such a control would help separate the effect of the vector formulation from the intended diversity induction.

Authors: We agree that this control helps separate the vector formulation from normalization effects. We have added the requested experiment in which per-dimension normalization is applied to a scalar reward (by replicating the scalar value across dimensions). The revised manuscript reports that this scalar-normalized control does not produce gains comparable to VPO on the test-time search metrics, indicating that the vector structure enabling specialization contributes beyond normalization alone. revision: yes

Circularity Check

0 steps flagged

VPO is an independent algorithmic proposal with empirical validation

full rationale

The paper introduces Vector Policy Optimization as a drop-in replacement for the GRPO advantage estimator, explicitly training policies on vector rewards to produce specialized solutions across trade-offs. The central claims rest on empirical results across four tasks demonstrating improved pass@k, best@k, and evolutionary search performance, rather than any mathematical derivation that reduces by construction to fitted inputs, self-citations, or prior ansatzes. No load-bearing step equates a prediction to its own definition or relies on uniqueness theorems imported from the authors' prior work. The method inherits standard RL assumptions but adds independent content through the vector normalization and diversity objective, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that practical rewards are vector-valued and that specialization to trade-offs yields net gains in search settings; no free parameters or invented physical entities are described in the abstract.

axioms (1)

domain assumption Rewards are often vector-valued in practice, like per-test-case correctness in code generation or multiple user personas.
Explicitly invoked in the abstract as the motivation and mechanism for VPO.

invented entities (1)

Vector Policy Optimization (VPO) no independent evidence
purpose: RL algorithm that trains policies to anticipate diverse downstream reward functions and produce specialized solutions.
New method proposed in the paper as a drop-in replacement for GRPO.

pith-pipeline@v0.9.0 · 5801 in / 1100 out tokens · 41569 ms · 2026-05-22T06:34:22.824340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VPO replaces the fixed scalarization ... with a distribution over scalarizations. For each rollout, we sample weights w∼Dir(α) ... R(S) = E_w∼Dir(α) [max_y∈S w⊤r(x,y)]
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VPO ... trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URLhttps://arxiv.org/abs/2504.13837. Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026. A Environment Details A.1 Maze Generation.Each maze is a 9×9 grid built in two stages. We first car...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

[{idx}] (Title: {title}) {paragraph_text}

postdates every training-time problem, ruling out contamination by construction. We use LCB as a two-arm scaling case study (VPO and scalar GRPO only) rather than as one of the four main benchmark domains. Training data is the DeepCoder corpus, 24,269 problems concatenated from three sources: 16,238 from PrimeIntellect SYNTHETIC-1 (stdin, easiest), 7,432 ...

work page 2024
[3]

Name: <tool_1_name> Description: <...> Parameters: {...}

work page
[4]

21 **Steps for Each Turn**

Name: <tool_2_name> ... 21 **Steps for Each Turn**

work page
[5]

**Think:** Recall relevant context and analyze the current user goal

work page
[6]

**Decide on Tool Usage:** If a tool is needed, specify the tool and its parameters

work page
[7]

name": "Tool name

**Respond Appropriately:** If a response is needed, generate one while maintaining consistency across user queries. **Output Format** <think> Your thoughts and reasoning </think> <tool_call> {"name": "Tool name", "parameters": {"Parameter name": "Parameter content", "... ...": "... ..."}} {"name": "... ...", "parameters": {"... ...": "... ...", "... ...":...

work page
[8]

Provide at least one of`<tool_call>`or`<response>`

You must always include the`<think>`field to outline your reasoning. Provide at least one of`<tool_call>`or`<response>`. Decide whether to use`<tool_call>` (possibly multiple times),`<response>`, or both

work page
[9]

name" field and an

You can invoke multiple tool calls simultaneously in the`<tool_call>`fields. Each tool call should be a JSON object with a "name" field and an "parameters" field containing a dictionary of parameters. If no parameters are needed, leave the "parameters" field an empty dictionary

work page
[10]

Find- ing evidence for reasoning step 1

Refer to the previous dialogue records in the history, including the user's queries, previous`<tool_call>`,`<response>`, and any tool feedback noted as`<obs>`(if exists). The user message is the task itself, e.g.: ToolRL: example user message **Dialogue Records History** <user> I need to make sure my library visit is smooth. Could you check if I'm a membe...

work page 2040
[11]

On-policy rollout pool.Sample the trained checkpoint for that method on every prompt in the held-out evaluation set, drawing N samples per prompt under the domain’s final- evaluation sampler (App. B). Per-domain pool shapes (prompts × samples × reward dims) are: Maze 500×30×4 ([completion, gold, diamond, avoid_lava] ); MuSiQue 300×30×5 (4 hop indicators +...

work page
[12]

For each prompt, evaluate the unbiased order-statistic estimator of E[maxk∈S rk] over a uniform 16-element subsetSof the per-prompt sample pool, then average over prompts

Best@16.Scalarize each response with the GRPO training scalar for that domain (per- domain formulas in Table 7; the ArmoRM “overall” dim for UltraFeedback). For each prompt, evaluate the unbiased order-statistic estimator of E[maxk∈S rk] over a uniform 16-element subsetSof the per-prompt sample pool, then average over prompts. 24

work page

[1] [1]

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

URLhttps://arxiv.org/abs/2504.13837. Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026. A Environment Details A.1 Maze Generation.Each maze is a 9×9 grid built in two stages. We first car...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

[{idx}] (Title: {title}) {paragraph_text}

postdates every training-time problem, ruling out contamination by construction. We use LCB as a two-arm scaling case study (VPO and scalar GRPO only) rather than as one of the four main benchmark domains. Training data is the DeepCoder corpus, 24,269 problems concatenated from three sources: 16,238 from PrimeIntellect SYNTHETIC-1 (stdin, easiest), 7,432 ...

work page 2024

[3] [3]

Name: <tool_1_name> Description: <...> Parameters: {...}

work page

[4] [4]

21 **Steps for Each Turn**

Name: <tool_2_name> ... 21 **Steps for Each Turn**

work page

[5] [5]

**Think:** Recall relevant context and analyze the current user goal

work page

[6] [6]

**Decide on Tool Usage:** If a tool is needed, specify the tool and its parameters

work page

[7] [7]

name": "Tool name

**Respond Appropriately:** If a response is needed, generate one while maintaining consistency across user queries. **Output Format** <think> Your thoughts and reasoning </think> <tool_call> {"name": "Tool name", "parameters": {"Parameter name": "Parameter content", "... ...": "... ..."}} {"name": "... ...", "parameters": {"... ...": "... ...", "... ...":...

work page

[8] [8]

Provide at least one of`<tool_call>`or`<response>`

You must always include the`<think>`field to outline your reasoning. Provide at least one of`<tool_call>`or`<response>`. Decide whether to use`<tool_call>` (possibly multiple times),`<response>`, or both

work page

[9] [9]

name" field and an

You can invoke multiple tool calls simultaneously in the`<tool_call>`fields. Each tool call should be a JSON object with a "name" field and an "parameters" field containing a dictionary of parameters. If no parameters are needed, leave the "parameters" field an empty dictionary

work page

[10] [10]

Find- ing evidence for reasoning step 1

Refer to the previous dialogue records in the history, including the user's queries, previous`<tool_call>`,`<response>`, and any tool feedback noted as`<obs>`(if exists). The user message is the task itself, e.g.: ToolRL: example user message **Dialogue Records History** <user> I need to make sure my library visit is smooth. Could you check if I'm a membe...

work page 2040

[11] [11]

On-policy rollout pool.Sample the trained checkpoint for that method on every prompt in the held-out evaluation set, drawing N samples per prompt under the domain’s final- evaluation sampler (App. B). Per-domain pool shapes (prompts × samples × reward dims) are: Maze 500×30×4 ([completion, gold, diamond, avoid_lava] ); MuSiQue 300×30×5 (4 hop indicators +...

work page

[12] [12]

For each prompt, evaluate the unbiased order-statistic estimator of E[maxk∈S rk] over a uniform 16-element subsetSof the per-prompt sample pool, then average over prompts

Best@16.Scalarize each response with the GRPO training scalar for that domain (per- domain formulas in Table 7; the ArmoRM “overall” dim for UltraFeedback). For each prompt, evaluate the unbiased order-statistic estimator of E[maxk∈S rk] over a uniform 16-element subsetSof the per-prompt sample pool, then average over prompts. 24

work page