VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
DGPO aggregates supervision at the group level with direction-aware multi-candidate comparisons to improve LLM alignment, delivering up to 3.6% average accuracy gains over baselines.
citing papers explorer
-
VIDEOP2R: Video Understanding from Perception to Reasoning
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
-
DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
DGPO aggregates supervision at the group level with direction-aware multi-candidate comparisons to improve LLM alignment, delivering up to 3.6% average accuracy gains over baselines.