RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.
max πβ A πΎβοΈ π§=1 π(π§|πΆ)π(πΆ, π§, π) # . Then we can simply see that by Definition 3 πstage βπ flat =πΌ
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
RubricEM uses rubric-guided stagewise policy decomposition and reflection-based meta-policy evolution to improve long-horizon research agents beyond verifiable rewards.