Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025

Si Shen, Peijun Shen, Wenhua Zhao, Danhao Zhu · 2025 · arXiv 2508.05928

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

VIDEOP2R: Video Understanding from Perception to Reasoning

cs.CV · 2025-11-14 · conditional · novelty 7.0

VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.

Gradient Extrapolation-Based Policy Optimization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

cs.CL · 2026-05-11 · unverdicted · novelty 4.0

DGPO aggregates supervision at the group level with direction-aware multi-candidate comparisons to improve LLM alignment, delivering up to 3.6% average accuracy gains over baselines.

citing papers explorer

Showing 3 of 3 citing papers.

VIDEOP2R: Video Understanding from Perception to Reasoning cs.CV · 2025-11-14 · conditional · none · ref 44
VideoP2R separates perception and reasoning in a process-aware RFT pipeline with a new CoT dataset and PA-GRPO rewards, reaching SOTA on six of seven video benchmarks.
Gradient Extrapolation-Based Policy Optimization cs.LG · 2026-05-07 · unverdicted · none · ref 29
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering up to 4x step speedup.
DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization cs.CL · 2026-05-11 · unverdicted · none · ref 8
DGPO aggregates supervision at the group level with direction-aware multi-candidate comparisons to improve LLM alignment, delivering up to 3.6% average accuracy gains over baselines.

Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer