HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.
X t∈Gk w(k) t ˆAt · ∇θ logπ θ(ot|st) # .(24) Substituting the decomposition from Eq.(21) into Eq.(24): 17 g(k) =E
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
citation-role summary
other 1
citation-polarity summary
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1roles
other 1polarities
unclear 1representative citing papers
citing papers explorer
-
HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.