EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Faithful GRPO cuts inconsistent chain-of-thought reasoning in visual spatial tasks from 24.5% to 1.7% while raising visual grounding scores by 13% and final answer accuracy on seven benchmarks.
citing papers explorer
-
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
-
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
-
Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
Faithful GRPO cuts inconsistent chain-of-thought reasoning in visual spatial tasks from 24.5% to 1.7% while raising visual grounding scores by 13% and final answer accuracy on seven benchmarks.