OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
Vineppo: Unlocking rl potential for llm reasoning through refined credit assignment
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.
citing papers explorer
-
OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning
OPPO derives token-level advantages for LLM RL via Bayesian recursion on oracle signals, recovering prior distillation methods as a special case and showing gains on math and code benchmarks.
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR adaptively reweights GRPO advantages in LLM RL by using divergence spikes from self-distillation to define semantic segments and modulate local credit.
-
Generalizing Verifiable Instruction Following
Introduces IFBench benchmark with 58 new constraints and demonstrates RLVR training improves generalization of language models to unseen verifiable output constraints.