Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Chao Wang; Hexuan Deng; Min Zhang; Ruiyu Fang; Shuangyong Song; Shuo Nie; Xuebo Liu; Xuelong Li; Yu Li

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 2602.05897 v2 pith:GHAVRVZF submitted 2026-02-05 cs.CL

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Shuo Nie , Hexuan Deng , Chao Wang , Ruiyu Fang , Xuebo Liu , Shuangyong Song , Yu Li , Min Zhang

show 1 more author

Xuelong Li

This is my paper

classification cs.CL

keywords reasoningstep-levelfaithrllearningmodelsreinforcementrewardsfaithful

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes, while also mitigating reward hacking from step-level rewards. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at https://github.com/Easy195/FaithRL.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CLIR-Bench: Benchmarking Multimodal Question Answering over Irregular Clinical Time Series
cs.CL 2026-07 conditional novelty 6.0

CLIR-Bench shows generalist and time-series LLMs struggle to ground clinical answers in sparse irregular ICU evidence, with top accuracy near 50% and weak causal evidence use.