X t∈Gk w(k) t ˆAt · ∇θ logπ θ(ot|st) # .(24) Substituting the decomposition from Eq.(21) into Eq.(24): 17 g(k) =E

The score function ∇θ logπ θ(ot|st) is L-Lipschitz continuous in θ, where we define st := (q, o<t)for brevity · 2048

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.

citing papers explorer

Showing 1 of 1 citing paper.

HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control cs.LG · 2026-05-08 · unverdicted · none · ref 45
HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.

X t∈Gk w(k) t ˆAt · ∇θ logπ θ(ot|st) # .(24) Substituting the decomposition from Eq.(21) into Eq.(24): 17 g(k) =E

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer