PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Direct preference optimization: Your language model is secretly a reward model
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2representative citing papers
Develops a McDiarmid-type concentration inequality for causal autoregressive processes that preserves sparsity to achieve O(1) variance proxies instead of O(N).
citing papers explorer
-
Process Reinforcement through Implicit Rewards
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
-
Matrix-Decoupled Concentration for Autoregressive Sequences: Dimension-Free Guarantees for Sparse Long-Context Rewards
Develops a McDiarmid-type concentration inequality for causal autoregressive processes that preserves sparsity to achieve O(1) variance proxies instead of O(N).