InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems

Improving data efficiency for LLM reinforcement fine-tuning through difficulty-targeted online data selection, rollout replay · 2025 · arXiv 2510.07242

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

cs.AI · 2026-01-08 · conditional · novelty 7.0

Miner uses intrinsic policy uncertainty with token-level focal credit assignment and adaptive advantage calibration as a self-supervised reward to enable efficient RL training on positive homogeneous prompts, yielding up to 4.58 Pass@1 gains over GRPO on Qwen3 models.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

citing papers explorer

Showing 1 of 1 citing paper after filters.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 85
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer