HDPO adds a propose-select-think stage to RLVR so LLMs generate diverse solution outlines as hints, select the most reliable, and reason from it, with experiments claiming improved reasoning and solution diversity.
Yes” as a measure of the similarity between the two candi- date solutions. As shown by “HDPO (LLM-Div)
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Hint-Guided Diversified Policy Optimization for LLM Reasoning
HDPO adds a propose-select-think stage to RLVR so LLMs generate diverse solution outlines as hints, select the most reliable, and reason from it, with experiments claiming improved reasoning and solution diversity.