dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , publisher =
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
DiffGap introduces adaptive alignment of denoising steps and temperature annealing in diffusion models for 3D molecule generation, reporting better docking scores and binding affinity than prior methods on CrossDocked2020.
citing papers explorer
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.
-
Bridging the Gap between Learning and Inference for Diffusion-Based Molecule Generation
DiffGap introduces adaptive alignment of denoising steps and temperature annealing in diffusion models for 3D molecule generation, reporting better docking scores and binding affinity than prior methods on CrossDocked2020.