Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Pith reviewed 2026-05-22 06:34 UTC · model grok-4.3
The pith
Vector Policy Optimization trains LLMs on vector rewards to generate diverse solutions that improve test-time search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VPO modifies the policy gradient estimator so that the model is rewarded for producing a set of outputs in which each output specializes to a different combination of the vector reward components; across four tasks this yields response distributions that support stronger test-time search than scalar-optimized baselines, with the performance gap increasing as the number of samples or evolutionary steps grows.
What carries the argument
The VPO advantage estimator, a replacement for the GRPO estimator that computes advantages over vector rewards to encourage specialization across different reward trade-offs.
Load-bearing premise
That training policies to specialize across vector reward dimensions will create useful diversity for search without lowering the quality of any single solution.
What would settle it
A held-out task or larger search budget in which VPO models produce lower pass@1 scores or show no improvement in best@k compared with GRPO models.
Figures
read the original abstract
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Vector Policy Optimization (VPO), a drop-in replacement for the GRPO advantage estimator in RL post-training of LLMs. VPO trains on vector-valued rewards (e.g., per-test-case correctness) so that the policy produces solutions specialized to different trade-offs in reward space. The central empirical claim is that this yields higher diversity, which improves downstream test-time search: across four tasks VPO matches or exceeds scalar baselines on pass@k and best@k, with the advantage increasing at larger search budgets, and enables evolutionary search to solve problems that remain unsolvable under GRPO-trained policies.
Significance. If the results hold after the mechanism is isolated, the work supplies concrete evidence that post-training objectives should explicitly target solution diversity rather than scalar reward maximization alone. This is timely given the growing use of inference-time search (AlphaEvolve-style procedures) and would shift default practice in RLHF-style pipelines. The paper already demonstrates reproducible widening gaps and previously unsolvable instances under evolutionary search, which are falsifiable and practically relevant strengths.
major comments (3)
- [Section 4.2, Figure 3] Section 4.2 and Figure 3: the widening gap in pass@k / best@k with search budget is presented as evidence that VPO induces useful diversity. However, the experiments do not include an ablation that keeps the vector advantage estimator but removes the specialization objective (or vice versa), leaving open the possibility that the lift arises from altered credit assignment or variance reduction rather than from higher-entropy coverage of the solution space. This directly affects the load-bearing claim that diversity, not the estimator change itself, drives the reported improvements.
- [Section 4.4] Section 4.4, evolutionary-search paragraph: the statement that VPO models 'unlock problems that GRPO models cannot solve at all' is central to the diversity argument. The manuscript should report the exact count of such problems, the precise success criterion (e.g., zero successes in M independent runs), and whether the same number of total rollouts was used for both methods; without these details the claim remains difficult to evaluate.
- [Section 3.1, Eq. (3)–(5)] Section 3.1, Eq. (3)–(5): the vector advantage is obtained by per-dimension normalization of the reward vector. It is not shown whether this normalization alone, applied to a scalar reward, would produce comparable gains; such a control would help separate the effect of the vector formulation from the intended diversity induction.
minor comments (2)
- [Table 1] Table 1: the baseline column headers should explicitly state whether the scalar RL runs used the identical hyper-parameters and number of gradient steps as the VPO runs.
- [Section 5] Section 5: the discussion of limitations mentions only compute cost; it should also note that vector-reward collection may not be available in every domain and discuss how the method degrades when only a single scalar reward is observable.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate additional ablations, quantitative details, and clarifications that strengthen the presentation of VPO's mechanism and results.
read point-by-point responses
-
Referee: [Section 4.2, Figure 3] Section 4.2 and Figure 3: the widening gap in pass@k / best@k with search budget is presented as evidence that VPO induces useful diversity. However, the experiments do not include an ablation that keeps the vector advantage estimator but removes the specialization objective (or vice versa), leaving open the possibility that the lift arises from altered credit assignment or variance reduction rather than from higher-entropy coverage of the solution space. This directly affects the load-bearing claim that diversity, not the estimator change itself, drives the reported improvements.
Authors: We agree that isolating the contribution of diversity from possible changes in credit assignment or variance reduction is important for the central claim. In the revised manuscript we have added an ablation that retains the vector advantage estimator while removing the specialization objective (by collapsing to a scalar reward during training). The results of this control show that the widening gap at higher search budgets does not appear without the specialization component. We have also added entropy statistics on the generated solutions to provide direct evidence of increased coverage of the solution space under VPO. revision: yes
-
Referee: [Section 4.4] Section 4.4, evolutionary-search paragraph: the statement that VPO models 'unlock problems that GRPO models cannot solve at all' is central to the diversity argument. The manuscript should report the exact count of such problems, the precise success criterion (e.g., zero successes in M independent runs), and whether the same number of total rollouts was used for both methods; without these details the claim remains difficult to evaluate.
Authors: We thank the referee for this request for precision. The revised Section 4.4 now explicitly states the exact number of problems that VPO solves but GRPO models solve zero times, defines the success criterion as zero successes across a fixed number of independent runs, and confirms that both methods were evaluated under identical total rollout budgets. These details are provided to allow readers to evaluate the claim directly. revision: yes
-
Referee: [Section 3.1, Eq. (3)–(5)] Section 3.1, Eq. (3)–(5): the vector advantage is obtained by per-dimension normalization of the reward vector. It is not shown whether this normalization alone, applied to a scalar reward, would produce comparable gains; such a control would help separate the effect of the vector formulation from the intended diversity induction.
Authors: We agree that this control helps separate the vector formulation from normalization effects. We have added the requested experiment in which per-dimension normalization is applied to a scalar reward (by replicating the scalar value across dimensions). The revised manuscript reports that this scalar-normalized control does not produce gains comparable to VPO on the test-time search metrics, indicating that the vector structure enabling specialization contributes beyond normalization alone. revision: yes
Circularity Check
VPO is an independent algorithmic proposal with empirical validation
full rationale
The paper introduces Vector Policy Optimization as a drop-in replacement for the GRPO advantage estimator, explicitly training policies on vector rewards to produce specialized solutions across trade-offs. The central claims rest on empirical results across four tasks demonstrating improved pass@k, best@k, and evolutionary search performance, rather than any mathematical derivation that reduces by construction to fitted inputs, self-citations, or prior ansatzes. No load-bearing step equates a prediction to its own definition or relies on uniqueness theorems imported from the authors' prior work. The method inherits standard RL assumptions but adds independent content through the vector normalization and diversity objective, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Rewards are often vector-valued in practice, like per-test-case correctness in code generation or multiple user personas.
invented entities (1)
-
Vector Policy Optimization (VPO)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VPO replaces the fixed scalarization ... with a distribution over scalarizations. For each rollout, we sample weights w∼Dir(α) ... R(S) = E_w∼Dir(α) [max_y∈S w⊤r(x,y)]
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VPO ... trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
URLhttps://arxiv.org/abs/2504.13837. Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, et al. Learning to discover at test time.arXiv preprint arXiv:2601.16175, 2026. A Environment Details A.1 Maze Generation.Each maze is a 9×9 grid built in two stages. We first car...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
[{idx}] (Title: {title}) {paragraph_text}
postdates every training-time problem, ruling out contamination by construction. We use LCB as a two-arm scaling case study (VPO and scalar GRPO only) rather than as one of the four main benchmark domains. Training data is the DeepCoder corpus, 24,269 problems concatenated from three sources: 16,238 from PrimeIntellect SYNTHETIC-1 (stdin, easiest), 7,432 ...
work page 2024
-
[3]
Name: <tool_1_name> Description: <...> Parameters: {...}
- [4]
-
[5]
**Think:** Recall relevant context and analyze the current user goal
-
[6]
**Decide on Tool Usage:** If a tool is needed, specify the tool and its parameters
-
[7]
**Respond Appropriately:** If a response is needed, generate one while maintaining consistency across user queries. **Output Format** <think> Your thoughts and reasoning </think> <tool_call> {"name": "Tool name", "parameters": {"Parameter name": "Parameter content", "... ...": "... ..."}} {"name": "... ...", "parameters": {"... ...": "... ...", "... ...":...
-
[8]
Provide at least one of`<tool_call>`or`<response>`
You must always include the`<think>`field to outline your reasoning. Provide at least one of`<tool_call>`or`<response>`. Decide whether to use`<tool_call>` (possibly multiple times),`<response>`, or both
-
[9]
You can invoke multiple tool calls simultaneously in the`<tool_call>`fields. Each tool call should be a JSON object with a "name" field and an "parameters" field containing a dictionary of parameters. If no parameters are needed, leave the "parameters" field an empty dictionary
-
[10]
Find- ing evidence for reasoning step 1
Refer to the previous dialogue records in the history, including the user's queries, previous`<tool_call>`,`<response>`, and any tool feedback noted as`<obs>`(if exists). The user message is the task itself, e.g.: ToolRL: example user message **Dialogue Records History** <user> I need to make sure my library visit is smooth. Could you check if I'm a membe...
work page 2040
-
[11]
On-policy rollout pool.Sample the trained checkpoint for that method on every prompt in the held-out evaluation set, drawing N samples per prompt under the domain’s final- evaluation sampler (App. B). Per-domain pool shapes (prompts × samples × reward dims) are: Maze 500×30×4 ([completion, gold, diamond, avoid_lava] ); MuSiQue 300×30×5 (4 hop indicators +...
-
[12]
Best@16.Scalarize each response with the GRPO training scalar for that domain (per- domain formulas in Table 7; the ArmoRM “overall” dim for UltraFeedback). For each prompt, evaluate the unbiased order-statistic estimator of E[maxk∈S rk] over a uniform 16-element subsetSof the per-prompt sample pool, then average over prompts. 24
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.