Poly-EPO: Training Exploratory Reasoning Models

Chelsea Finn; Dorsa Sadigh; Hengyuan Hu; Ifdita Hasan Orney; Jubayer Ibn Hamid; Noah Goodman; Shirley Wu; Shreya S Ramanujam

arxiv: 2604.17654 · v3 · submitted 2026-04-19 · 💻 cs.AI

Poly-EPO: Training Exploratory Reasoning Models

Ifdita Hasan Orney , Jubayer Ibn Hamid , Shreya S Ramanujam , Shirley Wu , Hengyuan Hu , Noah Goodman , Dorsa Sadigh , Chelsea Finn This is my paper

Pith reviewed 2026-05-10 05:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords set reinforcement learningexploratory policy optimizationlanguage model reasoninggeneralizationpass@k coverageresponse diversitytest-time compute

0 comments

The pith

Poly-EPO trains language models to generate sets of accurate and diverse reasoning responses for better generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Poly-EPO as a way to post-train language models by having them produce multiple responses at once that together do well under a reward while using different reasoning approaches. It provides a general way to adapt reinforcement learning algorithms to this set setting by altering the advantage calculation. This matters because standard training often leads to repetitive or narrow thinking, limiting how well models handle new problems or use more thinking time at test. The results indicate gains in covering correct answers when sampling multiple times, keeping outputs varied, and benefiting from extra computation.

Core claim

Poly-EPO works by optimizing language models over sets of responses using a modified reinforcement learning objective that rewards collective accuracy and exploratory strategies. The authors adapt standard RL methods through changes to advantage computation to achieve synergy between finding good solutions and trying new reasoning paths. On reasoning benchmarks this leads to models that find more correct answers within k samples, show more varied generations, and improve further when given more test-time compute.

What carries the argument

Set reinforcement learning framework with modified advantage computation that enables optimization of objectives promoting both accuracy and diversity in model outputs.

If this is right

Higher pass@k coverage on reasoning benchmarks.
Preservation of greater diversity in generated responses.
Effective scaling of performance with increased test-time compute.
Improved ability to generalize to novel problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend to training models for tasks requiring multiple valid solutions, such as mathematical proofs or creative generation.
By promoting internal exploration, it may lower the data requirements for achieving strong performance.
Experiments comparing it directly to other exploration techniques like entropy regularization would clarify its unique benefits.

Load-bearing premise

Altering the advantage computation for set-based RL creates a genuine synergy between exploration and exploitation instead of adding biases or defaulting to standard behavior.

What would settle it

A direct comparison where Poly-EPO fails to show higher pass@k, reduced diversity loss, or better test-time scaling than baseline RL methods on the evaluated benchmarks would disprove the effectiveness of the approach.

Figures

Figures reproduced from arXiv: 2604.17654 by Chelsea Finn, Dorsa Sadigh, Hengyuan Hu, Ifdita Hasan Orney, Jubayer Ibn Hamid, Noah Goodman, Shirley Wu, Shreya S Ramanujam.

**Figure 1.** Figure 1: Pass@k evaluations on test sets. The x-axis is the number of attempts k used in the evaluation while the y-axis is the coverage of the test set. The base model used here is Qwen3-4B-Base and the LM-judge used by Poly-EPO and GRPO+DIV for clustering responses is Qwen-3-4B-Instruct. 6 Experiments We now empirically evaluate Poly-EPO. In §6.1, we train and evaluate on mathematical reasoning benchmarks. In §6.… view at source ↗

**Figure 2.** Figure 2: Training dynamics on POLARIS-53k. Left: Average number of unique reasoning-strategy clusters among the correct generations sampled for each prompt during training. Clusters are assigned by Qwen-3-4b-Instruct, which groups generations according to their high-level reasoning strategy. Higher values indicate a greater diversity of successful reasoning approaches. Right: Fraction of training prompts for which … view at source ↗

**Figure 3.** Figure 3: Poly-EPO promotes broader branching during reasoning generation. Top: Branching structure of 8 sampled rollouts on an AIME 2026 problem for Poly-EPO (left) and GRPO (right). Nodes represent shared-prefix clusters, and edges denote branching into distinct continuations. Bottom: Average number of active branches as a function of token position on AIME 2026, BeyondAIME, and Minerva. Poly-EPO consistently main… view at source ↗

**Figure 4.** Figure 4: Majority@k evaluations on Test Sets. The x-axis denotes the number of sampled generations k used at test time. Top row: pass rate after selecting the answer returned by majority voting over the k samples. Bottom row: majority vote share, i.e., the fraction of the k votes assigned to the winning answer. Higher values indicate stronger agreement among sampled generations, while lower values indicate greater … view at source ↗

**Figure 5.** Figure 5: shows the training dynamics of the models, including how the diversity of generations evolves over training. In particular, we measure the number of distinct clusters among correct responses for each prompt, averaged over prompts in a batch, to quantify the diversity of successful strategies learned by the policy. We also measure the number of distinct clusters among incorrect responses to quantify the div… view at source ↗

read the original abstract

Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Poly-EPO adapts set RL via a modified advantage to push diverse reasoning sets in LMs, but the abstract leaves the modification's guarantees unshown.

read the letter

The paper's main contribution is a post-training recipe called Poly-EPO that trains language models to output sets of responses. The sets are optimized to be collectively accurate while varying in reasoning strategies, using a set-based RL setup that modifies how advantages are computed from standard algorithms. This is meant to improve generalization on reasoning benchmarks, raise pass@k coverage, keep generation diversity higher, and let performance scale better with extra test-time samples.

Referee Report

3 major / 2 minor

Summary. The paper introduces Poly-EPO, a post-training framework for language models that adapts standard RL algorithms to a set RL setting via a modification to advantage computation. The central claim is that this produces an objective encouraging sets of responses that are collectively accurate and exploratory in reasoning strategies, yielding improved generalization (higher pass@k coverage), preserved diversity, and better scaling with test-time compute on reasoning benchmarks.

Significance. If the unshown advantage modification and the reported empirical gains hold under rigorous verification, the work could offer a useful direction for explicitly balancing exploration and exploitation in LM post-training, particularly for reasoning tasks where test-time scaling and diversity matter. The set-level perspective is a potentially load-bearing idea for avoiding mode collapse in standard RL.

major comments (3)

[§3.1] §3.1 (set RL recipe): The adaptation of standard RL algorithms via modification to the advantage computation is described at a high level but supplies no explicit formula, derivation, fixed-point analysis, or guarantee that the modified advantage preserves set-wise exploration properties for arbitrary reward functions rather than reducing to per-response advantages or introducing hidden biases.
[§5] §5 (experiments): Claims of higher pass@k coverage, greater diversity, and effective test-time scaling are stated without quantitative values, error bars, baseline comparisons to standard RL or PPO variants, or ablations isolating the contribution of the Poly-EPO objective versus increased sampling volume alone.
[§4] §4 (Poly-EPO objective): The instantiation of the set RL framework with an objective that 'explicitly synergizes exploration and exploitation' lacks analysis showing it does not collapse to standard RL behavior or favor high-reward modes within sets, which is load-bearing for the generalization and diversity claims.

minor comments (2)

[Abstract] The abstract and introduction could clarify the precise definition of 'Polychromic' and its relation to the set-level accuracy term.
[§2] Notation for set-level versus individual-response rewards and advantages would benefit from an explicit table or running example to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments point by point below.

read point-by-point responses

Referee: [§3.1] §3.1 (set RL recipe): The adaptation of standard RL algorithms via modification to the advantage computation is described at a high level but supplies no explicit formula, derivation, fixed-point analysis, or guarantee that the modified advantage preserves set-wise exploration properties for arbitrary reward functions rather than reducing to per-response advantages or introducing hidden biases.

Authors: We agree with the referee that the description in Section 3.1 would benefit from greater formality. In the revised manuscript, we will provide the explicit mathematical formula for the modified advantage computation used in the set RL recipe. We will also include a derivation showing how this modification arises from the set-level objective and a brief analysis of its fixed points and behavior for general reward functions, demonstrating that it does not simply reduce to per-response advantages. revision: yes
Referee: [§5] §5 (experiments): Claims of higher pass@k coverage, greater diversity, and effective test-time scaling are stated without quantitative values, error bars, baseline comparisons to standard RL or PPO variants, or ablations isolating the contribution of the Poly-EPO objective versus increased sampling volume alone.

Authors: The referee correctly notes that the experimental claims require more rigorous quantitative support. We will revise Section 5 to include specific numerical results for pass@k improvements, diversity metrics (such as distinct n-gram ratios or entropy), and test-time scaling curves, along with error bars computed over multiple random seeds. We will also add direct comparisons to standard RL and PPO baselines, as well as ablations that control for sampling volume to isolate the contribution of the Poly-EPO objective. revision: yes
Referee: [§4] §4 (Poly-EPO objective): The instantiation of the set RL framework with an objective that 'explicitly synergizes exploration and exploitation' lacks analysis showing it does not collapse to standard RL behavior or favor high-reward modes within sets, which is load-bearing for the generalization and diversity claims.

Authors: We acknowledge that additional analysis is needed to support the claim that the Poly-EPO objective maintains exploration within sets. In the revision, we will augment Section 4 with a theoretical discussion or empirical study showing that the objective encourages diverse reasoning strategies rather than collapsing to high-reward modes. This will include comparisons of intra-set diversity metrics between Poly-EPO and standard RL training. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical results, not self-referential derivations

full rationale

The paper describes developing a general set-RL recipe via modification to advantage computation and instantiating it as Poly-EPO, but the provided text contains no equations, fixed-point analyses, or derivations that reduce the claimed synergy or generalization improvements to fitted parameters, self-definitions, or prior self-citations. Central claims are supported by benchmark evaluations (pass@k, diversity, test-time scaling) rather than any closed loop where outputs equal inputs by construction, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes that standard RL algorithms can be adapted via advantage modification without introducing new unstated assumptions about reward structure or policy optimization.

pith-pipeline@v0.9.0 · 5498 in / 1020 out tokens · 32812 ms · 2026-05-10T05:14:46.913472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

arXiv:2504.13914 [cs.CL].url:https://arxiv.org/abs/2504.13914. [SCS+21] Y. Seo, L. Chen, J. Shin, H. Lee, P. Abbeel, and K. Lee.State Entropy Maximization with Random Encoders for Efficient Exploration. 2021. arXiv:2102.09430 [cs.LG].url:https://arxiv.org/ abs/2102.09430. [SKM25] Y. Song, J. Kempe, and R. Munos.Outcome-based Exploration for LLM Reasoning....

work page arXiv 2021
[2]

A "Context" describing the task

work page
[3]

Each response contains a reasoning process and final answer

A numbered list of Responses from 1 to {n_responses}. Each response contains a reasoning process and final answer. Note: Responses may or may not explicitly state their strategy; you must infer the strategy by analyzing the mathematical steps taken. **CLUSTERING CRITERIA:** (1) Macro-strategy: The overall conceptual framework (e.g., recursion vs infinite ...

work page
[4]

1", "2", ...,

The JSON must contain exactly {n_responses} keys: "1", "2", ..., "{n_responses}"

work page
[5]

chain_of_thought

The value for each key must be: "chain_of_thought": "Macro: [short description]. Micro: [short description]." "cluster_id": integer

work page
[6]

**Few-Shot Example 1:** **Context:** What is the smallest value of x such that 5x - 1 = 3x + 2? **Responses:**

chain_of_thought must be concise and avoid repeating the actual calculations. **Few-Shot Example 1:** **Context:** What is the smallest value of x such that 5x - 1 = 3x + 2? **Responses:**

work page
[7]

Solving the first gives 2x = 3 so x = 1.5

We can split this into two cases: 5x - 1 = 3x + 2 or 5x - 1 = -(3x + 2). Solving the first gives 2x = 3 so x = 1.5. The second gives 8x = -1 so x = -0.125

work page
[8]

For x < -2/3, we have -(5x-1) = -(3x+2)

The expression 5x-1 changes sign at 1/5, and 3x+2 changes at -2/3. For x < -2/3, we have -(5x-1) = -(3x+2). For -2/3 < x < 1/5, we have -(5x-1) = 3x+2. Solve -(5x - 1) = 3x + 2 for the range, yielding x = -0.125

work page
[9]

So the answer is x = -0.125 21

Using the property that a=b implies a=b or a=-b, we get 5x-1 = 3x+2 (x=1.5) and 5x-1 = -3 x-2 (x=-0.125). So the answer is x = -0.125 21

work page
[10]

This expands to 25x^2 - 10x + 1 = 9x^2 + 12x + 4

To get rid of the absolute values, square both sides: (5x - 1)^2 = (3x + 2)^2. This expands to 25x^2 - 10x + 1 = 9x^2 + 12x + 4. Solve 16x^2 - 22x - 3 = 0. So, x = -1/8, 3/2. Final answer is x = -1/8

work page
[11]

This leads to x = 3/2 and x = 0

Either 5x - 1 = 3x + 2 or 5x - 1 = -3x - 2. This leads to x = 3/2 and x = 0. So, final answer is x = 0

work page
[12]

I think the answer is probably 0 or maybe 1.5

work page
[13]

1": {{"chain_of_thought

asdf qwer zxcv 9999 ---- ??? Let's write Python to check each x from -10 to 10:`if abs (5*x-1) == abs(3*x+2): print(x)`. The answer is -0.125. **Expected Output:** {{ "1": {{"chain_of_thought": "Macro: Algebraic casework. Micro: Direct +- case split to remove absolute values.", "cluster_id": 1}}, "2": {{"chain_of_thought": "Macro: Interval analysis. Micro...

work page
[14]

96 = 2^5 * 3^1

72 = 2^3 * 3^2. 96 = 2^5 * 3^1. To find the LCM, we take the highest power of each prime factor present: 2^5 * 3^2 = 32 * 9 = 288

work page
[15]

Prime factors of 96: 2, 2, 2, 2, 2, 3

Prime factors of 72: 2, 2, 2, 3, 3. Prime factors of 96: 2, 2, 2, 2, 2, 3. The union of these sets is five 2s and two 3s. Total: 276

work page
[16]

First find the GCD using the Euclidean algorithm: 96 = 72(1) + 24; 72 = 24(3) + 0. GCD is

work page
[17]

LCM is (72 * 96) / 24

work page
[18]

1": {{"chain_of_thought

72 = 8*9, 96 = 8*12. The answer is 288. The answer is 288. The answer is 288. The answer is 288. **Expected Output:** {{ "1": {{"chain_of_thought": "Macro: Prime factorization analysis. Micro: LCM via maximum exponents of prime factors.", "cluster_id": 1}}, "2": {{"chain_of_thought": "Macro: Prime factorization analysis. Micro: LCM via maximum exponents o...

work page
[19]

cluster 100

<response 2> ... {n_responses}. <response {n_responses}> During the evaluation phase, the static instruction block and the dynamically instantiated suffix are concatenated intoaunifiedusermessageandpassedtotheLM-judgeforinference. TheLM-judgeisusedtoclusterallNresponses, y1,· · ·, y N, sampled conditioned on a prompt. This gives us a cluster assignment,C(...

work page

[1] [1]

Seed1.5-thinking: Advancing superb reasoning models with reinforce- ment learning

arXiv:2504.13914 [cs.CL].url:https://arxiv.org/abs/2504.13914. [SCS+21] Y. Seo, L. Chen, J. Shin, H. Lee, P. Abbeel, and K. Lee.State Entropy Maximization with Random Encoders for Efficient Exploration. 2021. arXiv:2102.09430 [cs.LG].url:https://arxiv.org/ abs/2102.09430. [SKM25] Y. Song, J. Kempe, and R. Munos.Outcome-based Exploration for LLM Reasoning....

work page arXiv 2021

[2] [2]

A "Context" describing the task

work page

[3] [3]

Each response contains a reasoning process and final answer

A numbered list of Responses from 1 to {n_responses}. Each response contains a reasoning process and final answer. Note: Responses may or may not explicitly state their strategy; you must infer the strategy by analyzing the mathematical steps taken. **CLUSTERING CRITERIA:** (1) Macro-strategy: The overall conceptual framework (e.g., recursion vs infinite ...

work page

[4] [4]

1", "2", ...,

The JSON must contain exactly {n_responses} keys: "1", "2", ..., "{n_responses}"

work page

[5] [5]

chain_of_thought

The value for each key must be: "chain_of_thought": "Macro: [short description]. Micro: [short description]." "cluster_id": integer

work page

[6] [6]

**Few-Shot Example 1:** **Context:** What is the smallest value of x such that 5x - 1 = 3x + 2? **Responses:**

chain_of_thought must be concise and avoid repeating the actual calculations. **Few-Shot Example 1:** **Context:** What is the smallest value of x such that 5x - 1 = 3x + 2? **Responses:**

work page

[7] [7]

Solving the first gives 2x = 3 so x = 1.5

We can split this into two cases: 5x - 1 = 3x + 2 or 5x - 1 = -(3x + 2). Solving the first gives 2x = 3 so x = 1.5. The second gives 8x = -1 so x = -0.125

work page

[8] [8]

For x < -2/3, we have -(5x-1) = -(3x+2)

The expression 5x-1 changes sign at 1/5, and 3x+2 changes at -2/3. For x < -2/3, we have -(5x-1) = -(3x+2). For -2/3 < x < 1/5, we have -(5x-1) = 3x+2. Solve -(5x - 1) = 3x + 2 for the range, yielding x = -0.125

work page

[9] [9]

So the answer is x = -0.125 21

Using the property that a=b implies a=b or a=-b, we get 5x-1 = 3x+2 (x=1.5) and 5x-1 = -3 x-2 (x=-0.125). So the answer is x = -0.125 21

work page

[10] [10]

This expands to 25x^2 - 10x + 1 = 9x^2 + 12x + 4

To get rid of the absolute values, square both sides: (5x - 1)^2 = (3x + 2)^2. This expands to 25x^2 - 10x + 1 = 9x^2 + 12x + 4. Solve 16x^2 - 22x - 3 = 0. So, x = -1/8, 3/2. Final answer is x = -1/8

work page

[11] [11]

This leads to x = 3/2 and x = 0

Either 5x - 1 = 3x + 2 or 5x - 1 = -3x - 2. This leads to x = 3/2 and x = 0. So, final answer is x = 0

work page

[12] [12]

I think the answer is probably 0 or maybe 1.5

work page

[13] [13]

1": {{"chain_of_thought

asdf qwer zxcv 9999 ---- ??? Let's write Python to check each x from -10 to 10:`if abs (5*x-1) == abs(3*x+2): print(x)`. The answer is -0.125. **Expected Output:** {{ "1": {{"chain_of_thought": "Macro: Algebraic casework. Micro: Direct +- case split to remove absolute values.", "cluster_id": 1}}, "2": {{"chain_of_thought": "Macro: Interval analysis. Micro...

work page

[14] [14]

96 = 2^5 * 3^1

72 = 2^3 * 3^2. 96 = 2^5 * 3^1. To find the LCM, we take the highest power of each prime factor present: 2^5 * 3^2 = 32 * 9 = 288

work page

[15] [15]

Prime factors of 96: 2, 2, 2, 2, 2, 3

Prime factors of 72: 2, 2, 2, 3, 3. Prime factors of 96: 2, 2, 2, 2, 2, 3. The union of these sets is five 2s and two 3s. Total: 276

work page

[16] [16]

First find the GCD using the Euclidean algorithm: 96 = 72(1) + 24; 72 = 24(3) + 0. GCD is

work page

[17] [17]

LCM is (72 * 96) / 24

work page

[18] [18]

1": {{"chain_of_thought

72 = 8*9, 96 = 8*12. The answer is 288. The answer is 288. The answer is 288. The answer is 288. **Expected Output:** {{ "1": {{"chain_of_thought": "Macro: Prime factorization analysis. Micro: LCM via maximum exponents of prime factors.", "cluster_id": 1}}, "2": {{"chain_of_thought": "Macro: Prime factorization analysis. Micro: LCM via maximum exponents o...

work page

[19] [19]

cluster 100

<response 2> ... {n_responses}. <response {n_responses}> During the evaluation phase, the static instruction block and the dynamically instantiated suffix are concatenated intoaunifiedusermessageandpassedtotheLM-judgeforinference. TheLM-judgeisusedtoclusterallNresponses, y1,· · ·, y N, sampled conditioned on a prompt. This gives us a cluster assignment,C(...

work page