Hint-Guided Diversified Policy Optimization for LLM Reasoning

Can Ye; Kaixin Wu; Mingjie Zhong; Peifeng Li; Qiaoming Zhu; Xiaobo Li; Zhiyu Cao

arxiv: 2606.03021 · v1 · pith:PWK3NMHMnew · submitted 2026-06-02 · 💻 cs.CL

Hint-Guided Diversified Policy Optimization for LLM Reasoning

Zhiyu Cao , Kaixin Wu , Mingjie Zhong , Peifeng Li , Xiaobo Li , Can Ye , Qiaoming Zhu This is my paper

Pith reviewed 2026-06-28 10:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM reasoningpolicy optimizationreinforcement learningdiversified solutionshint-guided trainingpropose-select-thinkRLVR

0 comments

The pith

LLMs improve reasoning by first listing multiple solution outlines as hints then selecting the most reliable one to expand.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard reinforcement learning for LLMs rewards only final answer correctness and therefore fails to push models toward considering multiple approaches. HDPO instead trains the model in two stages to follow a propose-select-think path: it first outputs several candidate solution outlines, then chooses the strongest one before continuing. The authors argue this produces both greater variety among the candidates generated and stronger ability to pick the reliable path, which together raise final answer accuracy. A reader would care because the change directly mimics the human habit of weighing options rather than committing to the first workable idea.

Core claim

HDPO lets the model first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. The method runs through Cold Start for Structured Reasoning followed by Hint-Guided Diversified Reinforcement Learning so the model learns to generate diverse and reliable solutions along the propose-select-think trajectory. Experiments demonstrate that this raises overall reasoning performance while increasing both the diversity of candidate solutions and the model's skill at identifying which ones are reliable.

What carries the argument

The propose-select-think trajectory in which the model outputs multiple candidate solution outlines as hints before choosing one to reason from.

If this is right

Models produce a wider range of candidate solutions during reasoning.
The model becomes more accurate at identifying which generated solutions are reliable.
Final answer correctness rises on the reasoning tasks tested.
Training occurs in two explicit stages that first enforce structured output then add diversified reinforcement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hint-and-select pattern could be inserted at intermediate reasoning steps for multi-hop problems.
If listing hints adds inference cost, the method would need an explicit compute-accuracy trade-off curve to decide when it is worthwhile.
The approach might transfer to domains where the cost of exploring wrong paths is high, such as code generation or theorem proving.

Load-bearing premise

Training the model to output and then select among multiple solution outlines will raise final correctness without the selection step adding new errors or requiring substantially more inference compute.

What would settle it

A head-to-head run on a standard reasoning benchmark in which HDPO models show neither higher final-answer accuracy nor measurably greater solution diversity than ordinary RLVR baselines.

Figures

Figures reproduced from arXiv: 2606.03021 by Can Ye, Kaixin Wu, Mingjie Zhong, Peifeng Li, Qiaoming Zhu, Xiaobo Li, Zhiyu Cao.

**Figure 1.** Figure 1: Hit rate of HDPO and GRPO on OlympiadBench under different number of attempts. training paradigm. Within this framework, policy optimization algorithms are widely adopted to fine-tune LLMs, with Group Relative Policy Optimization (GRPO) (Shao et al., 2024) serving as a prominent instantiation of this approach. To ensure the credibility of complex reasoning, outcome-level rewards (Cobbe et al., 2021; Sha… view at source ↗

**Figure 2.** Figure 2: The comparison of HDPO’s “propose-select-think” reasoning process ( [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Hint-Guided Diversified Policy Optimization, comprising two stages: (1) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The effect of diversity reward configurations [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The effect of different weights of diversity [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: The prompt used in structured reasoning. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt used for solution relevance judgment. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The output produced by Qwen2.5-Math-7B following GRPO training with standard chain-of-thought [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The output produced by Qwen2.5-Math-7B following HDPO training using “propose-select-think” [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: The output produced by Qwen2.5-Math-7B following GRPO training with standard chain-of-thought [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: The output produced by Qwen2.5-Math-7B following HDPO training using “propose-select-think” [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: A failure case of Qwen2.5-Math-7B following HDPO training, where the candidate solutions did not [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

read the original abstract

Recent developments in Large Language Models (LLMs) have showcased impressive reasoning capabilities, with Reinforcement Learning with Verifiable Rewards (RLVR) being a promising enhancement strategy. However, existing reward mechanisms are constrained to the outcome-level correctness and lack explicit signals to guide the model to consider diverse solutions. In contrast, human problem solving typically involves evaluating multiple potential approaches and selecting the most reliable solution, a cognitive process that current RLVR frameworks do not explicitly incentivize. Inspired by this, we propose Hint-Guided Diversified Policy Optimization (HDPO), allowing the model to first list all potential candidate solution outlines as hints and then select the most reliable one for further reasoning. HDPO comprises two stages of Cold Start for Structured Reasoning and Hint-Guided Diversified Reinforcement Learning to incentivize the model to generate diverse and reliable solutions following the ``propose-select-think'' trajectory. Experimental results show that HDPO effectively boosts LLM reasoning and enhances the diversity of candidate solutions as well as the LLM's ability to identify reliable solutions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HDPO layers a propose-select-think loop onto RLVR but the abstract supplies no metrics or compute controls, so the claimed gains cannot be checked.

read the letter

The paper's core move is to train an LLM to first emit multiple candidate solution outlines as hints, then pick one before doing the actual reasoning. It splits this into a cold-start phase for structured output followed by hint-guided diversified RL. That explicit trajectory is the concrete addition to existing RLVR setups.

The idea itself is reasonable. Human solvers often sketch several approaches before committing, and current outcome-only rewards do not push models to practice that step. Framing the training around propose-select-think gives a clear signal for both diversity and selection.

The evidence is the main gap. The abstract states that HDPO improves reasoning, solution diversity, and selection reliability, yet it reports no accuracy numbers, no baseline comparisons, no datasets, and no ablations. The stress-test concern about inference compute holds up on the given description: generating multiple hints multiplies tokens before selection, so any reported lift could come from extra sampling rather than the learned selection mechanism. Without matched budgets or separate error analysis on the selection step, the results stay uncheckable.

This is aimed at groups already running RLVR-style training for reasoning models. A reader who wants to try adding an explicit diversity stage might borrow the trajectory even if the current numbers are missing.

It deserves a serious referee. The method is stated plainly enough that reviewers can ask the right questions about controls and measurements once the full experiments are in front of them.

I would send it to peer review.

Referee Report

2 major / 0 minor

Summary. The paper proposes Hint-Guided Diversified Policy Optimization (HDPO) as an enhancement to Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning. HDPO uses a two-stage process—Cold Start for Structured Reasoning followed by Hint-Guided Diversified Reinforcement Learning—to train models on a 'propose-select-think' trajectory: the model first generates multiple candidate solution outlines as hints, selects the most reliable one, and then performs further reasoning. The central claim is that this yields improved reasoning performance, greater diversity in candidate solutions, and better ability to identify reliable solutions, as demonstrated by experimental results.

Significance. If the experimental claims hold after proper controls, the work would address a genuine gap in outcome-only RLVR by explicitly incentivizing diversity and selection, potentially improving robustness in LLM reasoning without relying solely on outcome correctness. The two-stage training and propose-select structure are concrete and could be adopted more broadly if shown to outperform compute-matched baselines.

major comments (2)

[Abstract] Abstract: The abstract asserts that 'experimental results show that HDPO effectively boosts LLM reasoning' and enhances diversity and selection ability, but supplies no metrics, baselines, datasets, ablation details, or statistical tests. This makes the central empirical claim impossible to evaluate against the data.
[Experimental evaluation] Experimental evaluation (implied by the abstract's results claim): No indication is given that baseline RLVR runs were evaluated under an equivalent total token or sample budget (e.g., via repeated sampling or self-consistency at matched compute). Without this control, reported gains in final-answer correctness could be explained by the implicit increase in exploration from generating multiple hints rather than by the learned selection mechanism or diversity incentive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts that 'experimental results show that HDPO effectively boosts LLM reasoning' and enhances diversity and selection ability, but supplies no metrics, baselines, datasets, ablation details, or statistical tests. This makes the central empirical claim impossible to evaluate against the data.

Authors: We agree that the abstract lacks sufficient detail for evaluation. In revision we will expand it to report key accuracy improvements (e.g., on MATH/GSM8K), the main RLVR baseline, primary datasets, and a concise reference to the diversity and selection ablations, while respecting length limits. revision: yes
Referee: [Experimental evaluation] Experimental evaluation (implied by the abstract's results claim): No indication is given that baseline RLVR runs were evaluated under an equivalent total token or sample budget (e.g., via repeated sampling or self-consistency at matched compute). Without this control, reported gains in final-answer correctness could be explained by the implicit increase in exploration from generating multiple hints rather than by the learned selection mechanism or diversity incentive.

Authors: We acknowledge the need for explicit compute-matched controls. Our training-stage sample counts were matched, but inference-time token budgets for hint generation were not explicitly equalized against self-consistency baselines. We will add new evaluation experiments using repeated sampling and self-consistency at matched total token budgets and report the results in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural method with no derivations or self-referential fits

full rationale

The paper describes a two-stage training procedure (Cold Start for Structured Reasoning followed by Hint-Guided Diversified RL) at the level of high-level stages and trajectories without any equations, fitted parameters, or mathematical derivations. No load-bearing steps reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. Experimental claims rest on reported results rather than closed-loop constructions, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described or can be inferred.

pith-pipeline@v0.9.1-grok · 5721 in / 1074 out tokens · 21952 ms · 2026-06-28T10:52:59.559386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di-...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. 2024a. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. InForty-first International ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Learning to Reason under Off-Policy Guidance

Text2reward: Reward shaping with language models for reinforcement learning. InThe Twelfth 10 International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2025. Learning to reason under off-policy guidance.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024. Qwen2. 5-math technical report: Toward mathe- matical expert model via self-improvement.arXiv preprint arXiv:2409.12122. Shunyu Yao, Dian Yu, Jeffrey Zhao,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaoh...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Yes” as a measure of the similarity between the two candi- date solutions. As shown by “HDPO (LLM-Div)

OpenReview.net. 11 Methods AIME 25 Math Olympiad GPQA HDPO (0.6B) 31.69 93.28 64.76 69.89 HDPO (4B) 32.12 93.85 65.54 70.26 HDPO (bge-m3) 31.24 92.73 64.48 69.31 HDPO (B&R) 29.76 92.01 62.81 67.76 HDPO (LLM-Div) 32.51 92.14 64.22 70.47 Table 6: Comparison of different diversity measurement methods. A Prompt for Structured Reasoning We provide the prompt u...

work page arXiv 2024
[7]

as representative baseline algorithms for eval- uation. As shown in Table 8, integrating HDPO into each baseline algorithm yielded consistent per- formance improvements, thereby establishing the generalizability of our approach across diverse pol- icy optimization frameworks. G Impact of Maximum Number of Candidates To enhance the diversity of solutions e...

2023
[8]

ex- plore–evaluate–select

paradigms that rely on repeated model calls, explicit tree expansion, or post-hoc vot- ing during inference, HDPO internalizes the “ex- plore–evaluate–select” cognitive cycle directly into the policy network. This design preserves the fault- tolerance benefits of multi-path reasoning while eliminating the linear-to-exponential latency over- head inherent ...

2025
[9]

It should only elaborate on the high-level strategies and concepts, without going into specific calculations

The overview of the solution should be as concise as possible, containing 1 to 5 sentences. It should only elaborate on the high-level strategies and concepts, without going into specific calculations
[10]

Different candidate solutions should exist differences at the method level, so do not output similar solution overviews
[11]

Do not use line breaks within the same candidate solution overview, and use “\n\n” to separate different candidate solutions
[12]

[1]”, “[2]

You can propose 1 to 5 candidate solutions, each preceded by the serial number enclosed in square brackets, such as “[1]”, “[2]”, “[3]”, etc. Each candidate solution overview is a core summary of the proposed solution. After thinking through the candidate solution overviews, choose the most reliable one and place the step number of the chosen solution (e....
[13]

They use the same branch of mathematics or specific theorem
[14]

Yes", otherwise output

They rely on the same underlying logical dependencies, although the expressions may differ. If the two candidate solutions are the same, output "Yes", otherwise output "No", do not output redundant content. Question: {question} Candidate Solution 1: {candidate solution 1} Candidate Solution 2: {candidate solution 2} Figure 7: The prompt used for solution ...
[15]

Atx= 0, we have|c| ≤1
[16]

Atx= 1, we have|a+b+c| ≤1
[17]

We need to find the values ofa,b, andcthat maximize|a|+|b|+|c|while satisfying these constraints

Atx= 1 2, we have| a 4 + b 2 +c| ≤1. We need to find the values ofa,b, andcthat maximize|a|+|b|+|c|while satisfying these constraints. Let’s consider the possible values forc. Since|c| ≤1, we can setc= 1orc=−1 to maximize the sum. Let’s start withc= 1. Then we have: 1.|a+b+ 1| ≤1, which implies−2≤a+b≤0. 2.| a 4 + b 2 + 1| ≤1, which implies−2≤ a 4 + b 2 + ...
[18]

Then express a, b, c in terms of these values and use linear programming or symmetry arguments to maximize |a| + |b| + |c|

Use the fact that the quadratic polynomial is bounded by 1 in absolute value on [0,1], and evaluate it at key points such as x=0, x=1, and x=0.5 to derive constraints on a, b, c. Then express a, b, c in terms of these values and use linear programming or symmetry arguments to maximize |a| + |b| + |c|
[19]

Apply the method of Lagrange multipliers to maximize the linear functional |a| + |b| + |c| subject to the quadratic constraint

Consider the polynomial as a function in a finite-dimensional space and use the extreme values of the polynomial on [0,1] to form a system of inequalities. Apply the method of Lagrange multipliers to maximize the linear functional |a| + |b| + |c| subject to the quadratic constraint
[20]

However, this may miss the global maximum if the optimal polynomial is not symmetric or has non-zero a and b

Assume symmetry or specific forms (e.g., a=0, or b=0) to reduce complexity and test boundary cases. However, this may miss the global maximum if the optimal polynomial is not symmetric or has non-zero a and b
[21]

Scale and shift the Chebyshev polynomial to satisfy the bound|P(x)| ≤1and compute the coefficients to find the maximum of |a| + |b| + |c|

Use Chebyshev polynomials, which are known to be extremal with respect to maximum deviation from zero, and map the standard Chebyshev polynomial on [-1,1] to the interval [0,1]. Scale and shift the Chebyshev polynomial to satisfy the bound|P(x)| ≤1and compute the coefficients to find the maximum of |a| + |b| + |c|
[22]

propose-select-think

Use polynomial interpolation and optimization: since the polynomial is bounded on [0,1], consider the values at three points (e.g., 0, 1/2, 1) and use the resulting system to express coefficients, then apply bounds and optimization techniques to maximize the sum of absolute values. </Candidate Solutions> <selected>[4]</selected> <thinking> We are given th...
[23]

- PointPis 4 units away from the circle, so the distance fromPto the centerOis6 + 4 = 10

**Understand the Geometry:** - CircleOhas a radius of 6. - PointPis 4 units away from the circle, so the distance fromPto the centerOis6 + 4 = 10. - Chord ABis parallel to ← →OPand the distance between ABand ← →OPis 2
[24]

But sincePis 4 units away from the circle and AB is parallel to ← →OP, the perpendicular distance fromOto ABmust be 4 (as 8 would place AB outside the circle)

**Find the Length of Chord AB:** - Since ABis parallel to ← →OPand the distance between them is 2, the perpendicular distance fromOto ABis6−2 = 4or6 + 2 = 8. But sincePis 4 units away from the circle and AB is parallel to ← →OP, the perpendicular distance fromOto ABmust be 4 (as 8 would place AB outside the circle). - Using the Pythagorean theorem in the ...
[25]

- The chord ABis parallel to the x-axis and 2 units below the x-axis (since the distance from Oto ABis 4)

**Find the Coordinates:** - Place the circle centered atO(0,0)and pointPat(10,0). - The chord ABis parallel to the x-axis and 2 units below the x-axis (since the distance from Oto ABis 4). - So, the coordinates ofAandBcan be(−2 √ 5,−2)and(2 √ 5,−2)respectively
[26]

**CalculateP A 2 andP B 2:** - Using the distance formula, we get: P A2 = (10−(−2 √ 5))2 + (0−(−2)) 2 = (10 + 2 √ 5)2 + 22 = 100 + 40 √ 5 + 20 + 4 = 124 + 40 √ 5 P B2 = (10−2 √ 5)2 + (0−(−2)) 2 = (10−2 √ 5)2 + 22 = 100−40 √ 5 + 20 + 4 = 124−40 √ 5 - AddingP A 2 andP B 2: P A2 +P B 2 = (124 + 40 √
[27]

19 Case 2 (Generation Model: Qwen2.5-Math-7B-HDPO) Question: CircleOhas radius 6

= 248 So, the final answer is: 248 </thinking> 248 ✘Correct Answer:272 Figure 10: The output produced by Qwen2.5-Math-7B following GRPO training with standard chain-of-thought reasoning in the second test case. 19 Case 2 (Generation Model: Qwen2.5-Math-7B-HDPO) Question: CircleOhas radius 6. PointPlies outside circleO, and the shortest distance fromPto ci...
[28]

SinceABis parallel toOP and the distance between them is 2, the perpendicular distance fromOtoABis either4 + 2 = 6 or4−2 = 2

Use geometric symmetry and properties of circles and chords. SinceABis parallel toOP and the distance between them is 2, the perpendicular distance fromOtoABis either4 + 2 = 6 or4−2 = 2. This approach relies on parameterizing the circle and solving systems of equations
[29]

Use vector geometry: represent pointsAandBon the circle withABparallel to vector ⃗OP, and the perpendicular distance between the lines is 2
[30]

Since chordABis parallel to ← →OP, it is horizontal, and the distance betweenAB and ← →OPis 2, soABis either aty= 2ory=−2

Use coordinate geometry by placing the center of circleOat the origin and pointPalong the x-axis. Since chordABis parallel to ← →OP, it is horizontal, and the distance betweenAB and ← →OPis 2, soABis either aty= 2ory=−2. </Candidate Solutions> <selected>[3]</selected> <thinking> We place the center of circleOat the origin(0,0). Since the radius is 6, the ...
[31]

propose-select-think

= 136 + 136 = 272 The √ 2terms cancel out. Thus, the answer is272 </thinking> 272 ✓ Figure 11: The output produced by Qwen2.5-Math-7B following HDPO training using “propose-select-think” structured reasoning in the second test case. 20 Case 3 (Generation Model: Qwen2.5-Math-7B-HDPO) Question: Computesin 2 4◦ + sin2 8◦ + sin2 12◦ +· · ·+ sin 2 176◦. Respon...
[32]

Then apply the sum of cosine series formula for angles in arithmetic sequence, simplifying the resulting expression using symmetry and periodicity of the cosine function

Use the identitysin 2 θ= 1−cos 2θ 2 to rewrite each term in the sum. Then apply the sum of cosine series formula for angles in arithmetic sequence, simplifying the resulting expression using symmetry and periodicity of the cosine function
[33]

However, this approach lacks precision and relies on approximation, making it unsuitable for exact computation

Interpret the sum as a discrete approximation of an integral ofsin 2 xover a interval, and use numerical methods or symmetry to estimate the value. However, this approach lacks precision and relies on approximation, making it unsuitable for exact computation
[34]

Use complex exponentials to expresssin 2 θand convert the sum into a sum of powers of roots of unity, then apply properties of Gauss sums or Fourier analysis on finite groups
[35]

</Candidate Solutions> <selected>[1]</selected> <thinking> We are given the sum: sin2 4◦ + sin2 8◦ + sin2 12◦ +· · ·+ sin 2 176◦ This is a sum ofsin 2 θforθ= 4k ◦ wherek= 1,2,

Use a recursive or computational algorithm to evaluate each term numerically and sum them, but this is impractical for exact mathematical reasoning and lacks elegance and precision. </Candidate Solutions> <selected>[1]</selected> <thinking> We are given the sum: sin2 4◦ + sin2 8◦ + sin2 12◦ +· · ·+ sin 2 176◦ This is a sum ofsin 2 θforθ= 4k ◦ wherek= 1,2,...

[1] [1]

Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Di-...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi k1. 5: Scaling reinforcement learning with llms.arXiv preprint arXiv:2501.12599. Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R. Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. 2024a. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. InForty-first International ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Learning to Reason under Off-Policy Guidance

Text2reward: Reward shaping with language models for reinforcement learning. InThe Twelfth 10 International Conference on Learning Representa- tions, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. 2025. Learning to reason under off-policy guidance.arXiv p...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, and 1 others. 2024. Qwen2. 5-math technical report: Toward mathe- matical expert model via self-improvement.arXiv preprint arXiv:2409.12122. Shunyu Yao, Dian Yu, Jeffrey Zhao,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaoh...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Yes” as a measure of the similarity between the two candi- date solutions. As shown by “HDPO (LLM-Div)

OpenReview.net. 11 Methods AIME 25 Math Olympiad GPQA HDPO (0.6B) 31.69 93.28 64.76 69.89 HDPO (4B) 32.12 93.85 65.54 70.26 HDPO (bge-m3) 31.24 92.73 64.48 69.31 HDPO (B&R) 29.76 92.01 62.81 67.76 HDPO (LLM-Div) 32.51 92.14 64.22 70.47 Table 6: Comparison of different diversity measurement methods. A Prompt for Structured Reasoning We provide the prompt u...

work page arXiv 2024

[7] [7]

as representative baseline algorithms for eval- uation. As shown in Table 8, integrating HDPO into each baseline algorithm yielded consistent per- formance improvements, thereby establishing the generalizability of our approach across diverse pol- icy optimization frameworks. G Impact of Maximum Number of Candidates To enhance the diversity of solutions e...

2023

[8] [8]

ex- plore–evaluate–select

paradigms that rely on repeated model calls, explicit tree expansion, or post-hoc vot- ing during inference, HDPO internalizes the “ex- plore–evaluate–select” cognitive cycle directly into the policy network. This design preserves the fault- tolerance benefits of multi-path reasoning while eliminating the linear-to-exponential latency over- head inherent ...

2025

[9] [9]

It should only elaborate on the high-level strategies and concepts, without going into specific calculations

The overview of the solution should be as concise as possible, containing 1 to 5 sentences. It should only elaborate on the high-level strategies and concepts, without going into specific calculations

[10] [10]

Different candidate solutions should exist differences at the method level, so do not output similar solution overviews

[11] [11]

Do not use line breaks within the same candidate solution overview, and use “\n\n” to separate different candidate solutions

[12] [12]

[1]”, “[2]

You can propose 1 to 5 candidate solutions, each preceded by the serial number enclosed in square brackets, such as “[1]”, “[2]”, “[3]”, etc. Each candidate solution overview is a core summary of the proposed solution. After thinking through the candidate solution overviews, choose the most reliable one and place the step number of the chosen solution (e....

[13] [13]

They use the same branch of mathematics or specific theorem

[14] [14]

Yes", otherwise output

They rely on the same underlying logical dependencies, although the expressions may differ. If the two candidate solutions are the same, output "Yes", otherwise output "No", do not output redundant content. Question: {question} Candidate Solution 1: {candidate solution 1} Candidate Solution 2: {candidate solution 2} Figure 7: The prompt used for solution ...

[15] [15]

Atx= 0, we have|c| ≤1

[16] [16]

Atx= 1, we have|a+b+c| ≤1

[17] [17]

We need to find the values ofa,b, andcthat maximize|a|+|b|+|c|while satisfying these constraints

Atx= 1 2, we have| a 4 + b 2 +c| ≤1. We need to find the values ofa,b, andcthat maximize|a|+|b|+|c|while satisfying these constraints. Let’s consider the possible values forc. Since|c| ≤1, we can setc= 1orc=−1 to maximize the sum. Let’s start withc= 1. Then we have: 1.|a+b+ 1| ≤1, which implies−2≤a+b≤0. 2.| a 4 + b 2 + 1| ≤1, which implies−2≤ a 4 + b 2 + ...

[18] [18]

Then express a, b, c in terms of these values and use linear programming or symmetry arguments to maximize |a| + |b| + |c|

Use the fact that the quadratic polynomial is bounded by 1 in absolute value on [0,1], and evaluate it at key points such as x=0, x=1, and x=0.5 to derive constraints on a, b, c. Then express a, b, c in terms of these values and use linear programming or symmetry arguments to maximize |a| + |b| + |c|

[19] [19]

Apply the method of Lagrange multipliers to maximize the linear functional |a| + |b| + |c| subject to the quadratic constraint

Consider the polynomial as a function in a finite-dimensional space and use the extreme values of the polynomial on [0,1] to form a system of inequalities. Apply the method of Lagrange multipliers to maximize the linear functional |a| + |b| + |c| subject to the quadratic constraint

[20] [20]

However, this may miss the global maximum if the optimal polynomial is not symmetric or has non-zero a and b

Assume symmetry or specific forms (e.g., a=0, or b=0) to reduce complexity and test boundary cases. However, this may miss the global maximum if the optimal polynomial is not symmetric or has non-zero a and b

[21] [21]

Scale and shift the Chebyshev polynomial to satisfy the bound|P(x)| ≤1and compute the coefficients to find the maximum of |a| + |b| + |c|

Use Chebyshev polynomials, which are known to be extremal with respect to maximum deviation from zero, and map the standard Chebyshev polynomial on [-1,1] to the interval [0,1]. Scale and shift the Chebyshev polynomial to satisfy the bound|P(x)| ≤1and compute the coefficients to find the maximum of |a| + |b| + |c|

[22] [22]

propose-select-think

Use polynomial interpolation and optimization: since the polynomial is bounded on [0,1], consider the values at three points (e.g., 0, 1/2, 1) and use the resulting system to express coefficients, then apply bounds and optimization techniques to maximize the sum of absolute values. </Candidate Solutions> <selected>[4]</selected> <thinking> We are given th...

[23] [23]

- PointPis 4 units away from the circle, so the distance fromPto the centerOis6 + 4 = 10

**Understand the Geometry:** - CircleOhas a radius of 6. - PointPis 4 units away from the circle, so the distance fromPto the centerOis6 + 4 = 10. - Chord ABis parallel to ← →OPand the distance between ABand ← →OPis 2

[24] [24]

But sincePis 4 units away from the circle and AB is parallel to ← →OP, the perpendicular distance fromOto ABmust be 4 (as 8 would place AB outside the circle)

**Find the Length of Chord AB:** - Since ABis parallel to ← →OPand the distance between them is 2, the perpendicular distance fromOto ABis6−2 = 4or6 + 2 = 8. But sincePis 4 units away from the circle and AB is parallel to ← →OP, the perpendicular distance fromOto ABmust be 4 (as 8 would place AB outside the circle). - Using the Pythagorean theorem in the ...

[25] [25]

- The chord ABis parallel to the x-axis and 2 units below the x-axis (since the distance from Oto ABis 4)

**Find the Coordinates:** - Place the circle centered atO(0,0)and pointPat(10,0). - The chord ABis parallel to the x-axis and 2 units below the x-axis (since the distance from Oto ABis 4). - So, the coordinates ofAandBcan be(−2 √ 5,−2)and(2 √ 5,−2)respectively

[26] [26]

**CalculateP A 2 andP B 2:** - Using the distance formula, we get: P A2 = (10−(−2 √ 5))2 + (0−(−2)) 2 = (10 + 2 √ 5)2 + 22 = 100 + 40 √ 5 + 20 + 4 = 124 + 40 √ 5 P B2 = (10−2 √ 5)2 + (0−(−2)) 2 = (10−2 √ 5)2 + 22 = 100−40 √ 5 + 20 + 4 = 124−40 √ 5 - AddingP A 2 andP B 2: P A2 +P B 2 = (124 + 40 √

[27] [27]

19 Case 2 (Generation Model: Qwen2.5-Math-7B-HDPO) Question: CircleOhas radius 6

= 248 So, the final answer is: 248 </thinking> 248 ✘Correct Answer:272 Figure 10: The output produced by Qwen2.5-Math-7B following GRPO training with standard chain-of-thought reasoning in the second test case. 19 Case 2 (Generation Model: Qwen2.5-Math-7B-HDPO) Question: CircleOhas radius 6. PointPlies outside circleO, and the shortest distance fromPto ci...

[28] [28]

SinceABis parallel toOP and the distance between them is 2, the perpendicular distance fromOtoABis either4 + 2 = 6 or4−2 = 2

Use geometric symmetry and properties of circles and chords. SinceABis parallel toOP and the distance between them is 2, the perpendicular distance fromOtoABis either4 + 2 = 6 or4−2 = 2. This approach relies on parameterizing the circle and solving systems of equations

[29] [29]

Use vector geometry: represent pointsAandBon the circle withABparallel to vector ⃗OP, and the perpendicular distance between the lines is 2

[30] [30]

Since chordABis parallel to ← →OP, it is horizontal, and the distance betweenAB and ← →OPis 2, soABis either aty= 2ory=−2

Use coordinate geometry by placing the center of circleOat the origin and pointPalong the x-axis. Since chordABis parallel to ← →OP, it is horizontal, and the distance betweenAB and ← →OPis 2, soABis either aty= 2ory=−2. </Candidate Solutions> <selected>[3]</selected> <thinking> We place the center of circleOat the origin(0,0). Since the radius is 6, the ...

[31] [31]

propose-select-think

= 136 + 136 = 272 The √ 2terms cancel out. Thus, the answer is272 </thinking> 272 ✓ Figure 11: The output produced by Qwen2.5-Math-7B following HDPO training using “propose-select-think” structured reasoning in the second test case. 20 Case 3 (Generation Model: Qwen2.5-Math-7B-HDPO) Question: Computesin 2 4◦ + sin2 8◦ + sin2 12◦ +· · ·+ sin 2 176◦. Respon...

[32] [32]

Then apply the sum of cosine series formula for angles in arithmetic sequence, simplifying the resulting expression using symmetry and periodicity of the cosine function

Use the identitysin 2 θ= 1−cos 2θ 2 to rewrite each term in the sum. Then apply the sum of cosine series formula for angles in arithmetic sequence, simplifying the resulting expression using symmetry and periodicity of the cosine function

[33] [33]

However, this approach lacks precision and relies on approximation, making it unsuitable for exact computation

Interpret the sum as a discrete approximation of an integral ofsin 2 xover a interval, and use numerical methods or symmetry to estimate the value. However, this approach lacks precision and relies on approximation, making it unsuitable for exact computation

[34] [34]

Use complex exponentials to expresssin 2 θand convert the sum into a sum of powers of roots of unity, then apply properties of Gauss sums or Fourier analysis on finite groups

[35] [35]

</Candidate Solutions> <selected>[1]</selected> <thinking> We are given the sum: sin2 4◦ + sin2 8◦ + sin2 12◦ +· · ·+ sin 2 176◦ This is a sum ofsin 2 θforθ= 4k ◦ wherek= 1,2,

Use a recursive or computational algorithm to evaluate each term numerically and sum them, but this is impractical for exact mathematical reasoning and lacks elegance and precision. </Candidate Solutions> <selected>[1]</selected> <thinking> We are given the sum: sin2 4◦ + sin2 8◦ + sin2 12◦ +· · ·+ sin 2 176◦ This is a sum ofsin 2 θforθ= 4k ◦ wherek= 1,2,...