pith. sign in

arxiv: 2510.09388 · v2 · pith:ZSFYLQOInew · submitted 2025-10-10 · 💻 cs.LG · cs.CL

Don't Tell the Answer, Truly Guide the Reasoning During RL Rollouts

classification 💻 cs.LG cs.CL
keywords affinityhintmodelpolicyreasoningcapabilitiesfirstguide
0
0 comments X
read the original abstract

Reinforcement Learning (RL) has become a key driver for enhancing the long chain-of-thought (CoT) reasoning capabilities of Large Language Models (LLMs). However, prevalent methods like GRPO often fail when task difficulty exceeds model capacity, leading to reward sparsity and inefficient training. Prior work attempts to mitigate this with off-policy data, but such methods often induce severe distributional mismatches that destabilize policy updates. In this work, we identify a core issue underlying these failures, which we term low training affinity, and introduce Affinity, the first quantitative metric for monitoring the compatibility between external guidance and the model's intrinsic policy. To address this, we propose HINT, an adaptive framework designed to enhance reasoning capabilities while explicitly preserving high Affinity. First, instead of revealing partial answers, HINT supplies Meta-Hints, which act as abstract cognitive scaffolding to guide the model in articulating solutions independently. Second, to ensure stability, we integrate Affinity-Aware Policy Optimization (AAPO), which dynamically modulates the learning objective based on the Affinity. Extensive experiments across diverse benchmarks demonstrate that HINT consistently outperforms strong baselines, while exhibiting superior stability and robust generalization to out-of-distribution tasks. Code is available at https://github.com/ViviqwerAsd/HINT.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. No More Stale Feedback: Co-Evolving Critics for Open-World Agent Learning

    cs.AI 2026-01 unverdicted novelty 6.0

    ECHO jointly optimizes policy and critic via co-evolution, cascaded rollouts, and saturation-aware shaping to deliver non-stale feedback and higher success in open-world LLM agent RL.

  2. LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

    cs.CL 2026-05 unverdicted novelty 5.0

    LANG combines language-adaptive hint guidance, progressive decay, and difficulty-tailored learning horizons in RL to boost non-English reasoning performance while preserving language consistency.

  3. N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

    cs.LG 2026-06 unverdicted novelty 3.0

    N-GRPO enhances GRPO via Semantic Neighbor Mixing of token embeddings to improve diversity and consistency in LLM math reasoning rollouts.