ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

Shuaiyi Nie , Siyu Ding , Wenyuan Zhang , Linhao Yu , Tianmeng Yang , Yao Chen , Weichong Yin , Yu Sun

show 2 more authors

Hua Wu Tingwen Liu

Authors on Pith no claims yet

classification 💻 cs.CL

keywords reasoningstepsattentionattnpolengthperformanceredundantwhile

0 comments

read the original abstract

Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model's intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning
cs.LG 2026-04 unverdicted novelty 5.0

SHAPE improves average math reasoning accuracy by 3% while cutting token use by 30% through stage-aware hierarchical advantage and entropy-driven token redistribution.