DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Msrl: Scaling generative multimodal reward modeling via multi-stage reinforcement learning
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.
citing papers explorer
-
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
-
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.