TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Anuj Kumar; Jiaqi Wang; Jingxiang Chen; Kai Sun; Mohammad Kachuee; Nicolas Scheffer; Rakesh Wanga; Rulin Shao; Teja Gollapudi; Wen-tau Yih

arxiv: 2509.25760 · v2 · pith:ZIFSRQIGnew · submitted 2025-09-30 · 💻 cs.CL · cs.AI· cs.LG

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

Zhepei Wei , Xiao Yang , Kai Sun , Jiaqi Wang , Rulin Shao , Jingxiang Chen , Mohammad Kachuee , Teja Gollapudi

show 7 more authors

Yiwei Liao Nicolas Scheffer Rakesh Wanga Anuj Kumar Yu Meng Wen-tau Yih Xin Luna Dong

This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords hallucinationstruthfulnesstruthrlllmsmodelscorrectwhenabstention

0 comments

read the original abstract

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs
cs.CL 2026-05 unverdicted novelty 7.0

RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
cs.CL 2026-04 unverdicted novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing
cs.LG 2026-02 unverdicted novelty 7.0

Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...
AIPO: Learning to Reason from Active Interaction
cs.CL 2026-05 unverdicted novelty 6.0

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.