RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.
For π-almost every (π, π), the conditional distribution ππ(π |π, π,M) is differentiable inπ
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.