BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

Chang Luo; Ning Li; Weinan Zhang; Weiwen Liu; Wenbo Fei; Yan Xu; Yasheng Wang; Yifan Niu; Yong Yu; Zixuan Guo

arxiv: 2606.15893 · v2 · pith:UWAWVNMFnew · submitted 2026-06-14 · 💻 cs.CL

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

Ning Li , Zixuan Guo , Yan Xu , Wenbo Fei , Yifan Niu , Chang Luo , Yasheng Wang , Weiwen Liu

show 2 more authors

Yong Yu Weinan Zhang

This is my paper

Pith reviewed 2026-06-27 04:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucination mitigationtoken-level policy optimizationreinforcement learninglarge language modelsclaim verificationfaithfulnessRAGcredit assignment

0 comments

The pith

BALTO uses balanced token-level credit assignment to reduce hallucinations in LLMs while preserving response informativeness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate hallucinations when responses must stay faithful to provided evidence. Response-level reinforcement learning rewards create a granularity mismatch because one local error can penalize supported content. Claim-level verification offers finer signals but still suffers from unbalanced credit assignment that favors length or verbosity. BALTO extracts checkable claims, verifies them against the reference, projects the labels to tokens, and applies a balanced credit mechanism that moves probability mass from unsupported tokens toward faithful ones. Experiments across six model-benchmark combinations show BALTO reaches the highest faithfulness scores and a stronger faithfulness-informativeness trade-off than prior post-training methods.

Core claim

BALTO extracts factual claims from model outputs, verifies each claim against the reference context, projects the resulting judgments to token-level labels, and applies a balanced token-level credit assignment that redistributes probability mass from unsupported content to faithful content; theoretical analysis establishes that this design improves training stability and optimization efficiency over response-level rewards for hallucination mitigation.

What carries the argument

The balanced token-level credit assignment mechanism that redistributes probability mass from unsupported content toward faithful content instead of suppressing entire responses.

If this is right

BALTO attains the highest faithfulness across all six model-benchmark settings tested.
It consistently exceeds existing post-training baselines on Q-Score.
It exhibits a stronger faithfulness-informativeness trade-off than prior methods.
Training stability and optimization efficiency improve for hallucination mitigation tasks.
Response-level rewards are shown to suffer from granularity mismatch that token-level balancing avoids.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same projection and balancing steps could be tested on multi-turn or open-ended generation tasks where evidence is only partially available.
If the token labels remain unbiased across domains, the framework might reduce reliance on human preference data for general alignment objectives.
Applying BALTO to models that already use retrieval-augmented generation could isolate whether the gains come mainly from the credit mechanism or from the claim extraction step.
Measuring the variance of token-level rewards before and after balancing would provide a direct check on the claimed stability improvement.

Load-bearing premise

The projection from claim-level verification judgments to token-level labels can be performed without introducing systematic bias or requiring extra fitted parameters that depend on the target faithfulness metric.

What would settle it

A new model-benchmark pair in which BALTO does not achieve the highest faithfulness score or underperforms existing baselines on Q-Score.

Figures

Figures reproduced from arXiv: 2606.15893 by Chang Luo, Ning Li, Weinan Zhang, Weiwen Liu, Wenbo Fei, Yan Xu, Yasheng Wang, Yifan Niu, Yong Yu, Zixuan Guo.

**Figure 2.** Figure 2: Overview of the BALTO framework is demonstrated. Initially, the reward model localize [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Gradient norm distribution on RAGTruth. 6 8 10 12 14 16 Model Parameter Update (%) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Faith. BALTO FSPO GRPO_B GRPO_D [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 5.** Figure 5: Faithfulness improvement during training across different RL methods on three datasets. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO's advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model--benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness--informativeness trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BALTO's token-level balancing and claim projection address a real granularity issue in RL for hallucinations, but the projection step may carry unisolated bias.

read the letter

The main thing to know is that BALTO projects claim-level verification labels down to tokens and adds a balanced credit assignment step inside the policy update, aiming to avoid the broad penalties that response-level rewards create when only part of an answer hallucinates.

What is new is the explicit balancing mechanism that shifts probability toward supported tokens rather than just down-weighting the whole sequence, plus the theoretical argument that this improves stability and efficiency compared with standard response-level RL. The experiments run the method on three datasets across six model-benchmark pairs and report the highest faithfulness scores plus a better faithfulness-informativeness trade-off than the listed baselines.

The soft spot is the projection step itself. The paper does not show that mapping claim judgments to individual tokens avoids length or position correlations, nor does it test whether the projection rule was tuned in a way that favors the final Q-Score metric. If that upstream labeling already correlates with the evaluation, some of the reported gains could be artifacts rather than evidence for the balancing idea. The stability proof assumes clean token labels, so it does not cover this part.

The work is aimed at people building post-training pipelines for knowledge-grounded generation. A reader who already works with claim extraction or token-level rewards will find the experiments and the attempt at theory useful to examine.

It deserves peer review because the core mismatch it targets is real and the empirical results are competitive, even though the projection bias question needs direct checks in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes BALTO, a token-level RL framework for mitigating hallucinations in LLMs. It extracts verifiable claims from responses, verifies them against reference context, projects the judgments to token-level labels, and applies a balanced credit-assignment mechanism that redistributes probability mass from unsupported to supported tokens. The authors provide a theoretical analysis of response-level reward limitations and prove BALTO's advantages in training stability and optimization efficiency; experiments on ConFiQA, RAGTruth, and FinLLM-Eval across six model-benchmark pairs report highest faithfulness and improved Q-Score over post-training baselines.

Significance. If the central claims hold, BALTO would represent a meaningful advance in fine-grained RL for hallucination mitigation by addressing granularity mismatch without suppressing entire responses. The theoretical stability analysis and consistent outperformance on faithfulness-informativeness trade-off would be notable strengths for knowledge-intensive applications.

major comments (2)

[Abstract / §3] Abstract and §3 (method): the projection from claim-level verification judgments to token-level labels is load-bearing for both the theoretical guarantees and the reported experimental gains, yet the manuscript provides no analysis showing this projection is free of systematic bias (length, position, or verifier-dependent correlations) or additional fitted parameters whose tuning depends on the downstream Q-Score metric. The stability proof assumes clean token labels and therefore does not address this upstream step.
[§4] §4 (experiments): the claim that BALTO achieves the highest faithfulness across all six settings and a stronger trade-off rests on the projection step; without ablations isolating the projection rule from the balanced credit-assignment mechanism, it is unclear whether the reported superiority is attributable to the proposed method or to the label-generation procedure itself.

minor comments (2)

[§3] Notation for the balanced credit-assignment term should be defined explicitly with an equation number rather than described only in prose.
[§4] Dataset statistics (number of claims per response, verification accuracy of the claim extractor) are needed to assess the reliability of the token-label pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each major comment below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method): the projection from claim-level verification judgments to token-level labels is load-bearing for both the theoretical guarantees and the reported experimental gains, yet the manuscript provides no analysis showing this projection is free of systematic bias (length, position, or verifier-dependent correlations) or additional fitted parameters whose tuning depends on the downstream Q-Score metric. The stability proof assumes clean token labels and therefore does not address this upstream step.

Authors: We appreciate the referee pointing out the importance of validating the projection step. The projection is a direct, parameter-free mapping: each token receives the label of the claim it belongs to, or neutral if not part of any claim. No additional parameters are fitted, and it does not depend on the Q-Score. However, we agree that explicit checks for systematic biases (e.g., longer claims affecting more tokens, position biases) are absent from the current version. We will add an analysis section with empirical statistics on label distributions across the datasets and correlation checks with length and position. For the stability proof, it indeed assumes the token labels as input and analyzes the balanced credit assignment; we will update the text to explicitly state this scope and note that the projection is a preprocessing step. revision: yes
Referee: [§4] §4 (experiments): the claim that BALTO achieves the highest faithfulness across all six settings and a stronger trade-off rests on the projection step; without ablations isolating the projection rule from the balanced credit-assignment mechanism, it is unclear whether the reported superiority is attributable to the proposed method or to the label-generation procedure itself.

Authors: We acknowledge that isolating the contribution of the balanced credit-assignment from the projection would provide clearer attribution. The projection is a necessary component to enable token-level rewards, but the key innovation is the balanced redistribution mechanism. To address this, we will include additional ablation experiments in the revised manuscript, such as comparing the balanced mechanism against a standard token-level RL using the same projection, and variants with different projection rules if applicable. This will help demonstrate that the gains come from the proposed balanced assignment. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; projection and proof presented as independent steps

full rationale

The paper introduces a claim-to-token projection and balanced credit assignment as novel mechanisms, followed by a theoretical analysis of response-level reward limitations and a stability proof for BALTO. No equations, fitted parameters, or self-citations are visible that reduce the claimed faithfulness gains or stability advantages back to the input labels or prior author work by construction. The central claims rest on the described pipeline and external benchmarks rather than self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted. The method implicitly assumes accurate claim extraction and verification as a prerequisite step.

pith-pipeline@v0.9.1-grok · 5789 in / 1043 out tokens · 42147 ms · 2026-06-27T04:03:49.308251+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 11 linked inside Pith

[1]

Large language models hallucination: A comprehensive survey.Computer Science Review, 61:100970, 2026

Aisha Alansari and Hamzah Luqman. Large language models hallucination: A comprehensive survey.Computer Science Review, 61:100970, 2026

2026
[2]

Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

arXiv 2024
[3]

Context-dpo: Aligning language models for context-faithfulness

Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10280–10300, 2025

2025
[4]

Dense reward for free in reinforcement learning from human feedback.arXiv preprint arXiv:2402.00782, 2024

Alex J Chan, Hao Sun, Samuel Holt, and Mihaela Van Der Schaar. Dense reward for free in reinforcement learning from human feedback.arXiv preprint arXiv:2402.00782, 2024

arXiv 2024
[5]

Discriminative policy optimization for token-level reward models.arXiv preprint arXiv:2505.23363, 2025

Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, and Ting Yao. Discriminative policy optimization for token-level reward models.arXiv preprint arXiv:2505.23363, 2025

arXiv 2025
[6]

Research: Learning to reason with search for llms via reinforcement learning

M Chen, L Sun, T Li, H Sun, Y Zhou, C Zhu, H Wang, JZ Pan, W Zhang, H Chen, et al. Research: Learning to reason with search for llms via reinforcement learning. arxiv 2025.arXiv preprint arXiv:2503.19470

Pith/arXiv arXiv 2025
[7]

Learning to reason for factuality.arXiv preprint arXiv:2508.05618, 2025

Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen-tau Yih. Learning to reason for factuality.arXiv preprint arXiv:2508.05618, 2025

arXiv 2025
[8]

Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025

Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, and Fei-Yue Wang. Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025

arXiv 2025
[9]

Decoding by contrasting layers improves factuality in large language models (dola).arXiv Preprint, pages 1–8, 2024

YS Chuang et al. Decoding by contrasting layers improves factuality in large language models (dola).arXiv Preprint, pages 1–8, 2024

2024
[10]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

2024
[11]

Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. arXiv e-prints, pages arXiv–2402, 2024

2024
[12]

Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making.arXiv preprint arXiv:2602.06570, 2026

Chengfeng Dou, Fan Yang, Fei Li, Jiyuan Jia, Qiang Ju, Shuai Wang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Hongda Zhang, et al. Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making.arXiv preprint arXiv:2602.06570, 2026

arXiv 2026
[13]

Good learners think their thinking: Generative prm makes large reasoning model more efficient math learner.arXiv preprint arXiv:2507.23317, 2025

Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, and Bing Qin. Good learners think their thinking: Generative prm makes large reasoning model more efficient math learner.arXiv preprint arXiv:2507.23317, 2025

arXiv 2025
[14]

Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267, 2024

Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, and Irwin King. Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267, 2024. 10

arXiv 2024
[15]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

2023
[16]

Ai-generated news articles based on large language models

Kai Jiang, Qilai Zhang, Dongsheng Guo, Dengrong Huang, Sijia Zhang, Zizhong Wei, Fanggang Ning, and Rui Li. Ai-generated news articles based on large language models. InProceedings of the 2023 International Conference on Artificial Intelligence, Systems and Network Security, pages 82–87, 2023

2023
[17]

Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

Haoqiang Kang and Xiao-Yang Liu. Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

arXiv 2023
[18]

Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777, 2025

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777, 2025

arXiv 2025
[19]

Know the unknown: An uncertainty-sensitive method for llm instruction tuning

Jiaqi Li, Yixuan Tang, and Yi Yang. Know the unknown: An uncertainty-sensitive method for llm instruction tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2972–2989, 2025

2025
[20]

Knowledge-level consistency reinforcement learning: Dual-fact alignment for long-form factuality, 2026

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, and Haifeng Wang. Knowledge-level consistency reinforcement learning: Dual-fact alignment for long-form factuality, 2026. URLhttps://openreview.net/forum?id=Q04RwdeN9z

2026
[21]

Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models

Junyi Li and Hwee Tou Ng. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id=Igq7Dyc3OL

2026
[22]

finllm-eval: Evaluation methods for the logical, factual, and data accuracy of large models in the financial domain.https://www.github.com/Tencent/finLLM-Eval, 2025

Lichang Liang, Shaohui Wu, Yunlong Zhang, Yeyang Tang, and Fanyang Lu. finllm-eval: Evaluation methods for the logical, factual, and data accuracy of large models in the financial domain.https://www.github.com/Tencent/finLLM-Eval, 2025

2025
[23]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023
[24]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

Pith/arXiv arXiv 2025
[25]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023
[26]

Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711, 2025

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711, 2025

arXiv 2025
[27]

Entity-level factual consistency of abstractive text summarization

Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. Entity-level factual consistency of abstractive text summarization. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, 2021

2021
[28]

Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, 2024

2024
[29]

Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, 2021. 11

2021
[30]

Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807, 2025

Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807, 2025

Pith/arXiv arXiv 2025
[31]

A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1, 2024

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1, 2024

Pith/arXiv arXiv 2024
[32]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024
[33]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Pith/arXiv arXiv 2024
[34]

Fspo: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. Fspo: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

Pith/arXiv arXiv 2025
[35]

R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

Pith/arXiv arXiv 2025
[36]

Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

arXiv 2025
[37]

Mechanistic detection and mitigation of hallucination in large reasoning models

Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, and Jun Xu. Mechanistic detection and mitigation of hallucination in large reasoning models. InThe Fourteenth International Conference on Learning Representations
[38]

Benchmarking llm faithfulness in rag with evolving leaderboards

Manveer Singh Tamber, Forrest Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, and Jimmy Lin. Benchmarking llm faithfulness in rag with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 799–811, 2025

2025
[39]

Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349, 2025

Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349, 2025

arXiv 2025
[40]

Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Pith/arXiv arXiv 2022
[41]

On-policy self-alignment with fine-grained knowledge feedback for hallucination mitigation

Xueru Wen, Jie Lou, Xinyu Lu, Yuqiu Ji, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, et al. On-policy self-alignment with fine-grained knowledge feedback for hallucination mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5215–5231, 2025

2025
[42]

Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

arXiv 2025
[43]

Toward reliable scientific hypothesis generation: Evaluat- ing truthfulness and hallucination in large language models.arXiv preprint arXiv:2505.14599, 2025

Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, and Aidong Zhang. Toward reliable scientific hypothesis generation: Evaluat- ing truthfulness and hallucination in large language models.arXiv preprint arXiv:2505.14599, 2025

arXiv 2025
[44]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[45]

Mitigating hallucination in financial retrieval-augmented generation via fine-grained knowledge verification.arXiv preprint arXiv:2602.05723, 2026

Taoye Yin, Haoyuan Hu, Yaxin Fan, Xinhao Chen, Xinya Wu, Kai Deng, Kezun Zhang, and Feng Wang. Mitigating hallucination in financial retrieval-augmented generation via fine-grained knowledge verification.arXiv preprint arXiv:2602.05723, 2026. 12

arXiv 2026
[46]

Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback

Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Nam, Daejin Jo, Kyoung- Woon On, Mark Hasegawa-Johnson, Sungwoong Kim, and Chang Yoo. Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14969–14981, 2024

2024
[47]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025
[48]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, 2024

2024
[49]

Reward- sql: Boosting text-to-sql via stepwise reasoning and process-supervised rewards.arXiv preprint arXiv:2505.04671, 2025

Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li. Reward- sql: Boosting text-to-sql via stepwise reasoning and process-supervised rewards.arXiv preprint arXiv:2505.04671, 2025

arXiv 2025
[50]

Enhancing zero-shot chain-of-thought reasoning in large language models through logic

Xufeng Zhao, Mengdi Li, Wenhao Lu, Cornelius Weber, Jae-Hee Lee, Kun Chu, and Stefan Wermter. Enhancing zero-shot chain-of-thought reasoning in large language models through logic. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6144–6166, 2024

2024
[51]

TX t=1 Abal t ∇θ logπ θ(yt|x, y<t) # =E y∼πθ

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 13 A Related Work A.1 Hallucination In LLMs, hallucinations refer to syntactically fluent outputs that contain factual errors ...

2023
[52]

Response Text

**Extraction:** Extract all informational sentences from the "Response Text." - Extracted sentences typically contain specific data pointssuch as figures ( dates, times, and various other numerical values), entities (people, places, venues, etc.), and logical relationships (e.g., whether a cause-and-effect link holds true)
[53]

Reference Materials

**Localization:** Locate the corresponding paragraphs within the "Reference Materials."
[54]

**Comparison:** Compare the statements made in the response against the original wording found in the reference materials
[55]

**Verdict:** - **Correct:** The statement aligns perfectly with the content found in the reference materials. - **Incorrect:** A corresponding statement exists in the reference materials, but the response’s phrasing does not match it; *or* the statement cannot be found within the specified reference materials (i.e., it is fabricated/ hallucinated). 19
[56]

erroneous segment

**Error Localization:** For every incorrect item, further pinpoint the specific "erroneous segment" (‘error_spans‘) within the "Response Text" itself, to facilitate subsequent token-level penalties. - For each incorrect item, you must identify the minimal erroneous segment within the [Response Text] and output it to the ‘error_spans‘ list. - Each entry in...
[57]

**Precise Numerical Replacement (Primary Choice)**: Modify only the incorrect number (e.g., ‘17.96%‘ ‘12.08%‘)
[58]

**Precise Entity Replacement (Secondary Choice)**: If changing only the number results in a semantic contradiction (e.g., a subject mismatch), make minor adjustments to the local subject entity name or qualifier (e.g., ‘Bond D‘ ‘ Bond A‘)
[59]

**Local Refinement (Tertiary Choice)**: If simply replacing the number or entity leads to grammatical errors, semantic incoherence, or logical conflicts within the context, you are permitted to make minor additions, deletions, or stylistic tweaks to the specific sentenceor its immediate contextto ensure it reads smoothly
[60]

hallucination

**Full Sentence Deletion (Last Resort)**: If the erroneous information is a complete fabrication (a "hallucination") that cannot be factually corrected through simple editsor if correcting it would cause the logic of the entire paragraph to collapsedirectly delete the entire sentence containing the error . ## Step 3: Formatting Requirements - Preserve the...

[1] [1]

Large language models hallucination: A comprehensive survey.Computer Science Review, 61:100970, 2026

Aisha Alansari and Hamzah Luqman. Large language models hallucination: A comprehensive survey.Computer Science Review, 61:100970, 2026

2026

[2] [2]

Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

arXiv 2024

[3] [3]

Context-dpo: Aligning language models for context-faithfulness

Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10280–10300, 2025

2025

[4] [4]

Dense reward for free in reinforcement learning from human feedback.arXiv preprint arXiv:2402.00782, 2024

Alex J Chan, Hao Sun, Samuel Holt, and Mihaela Van Der Schaar. Dense reward for free in reinforcement learning from human feedback.arXiv preprint arXiv:2402.00782, 2024

arXiv 2024

[5] [5]

Discriminative policy optimization for token-level reward models.arXiv preprint arXiv:2505.23363, 2025

Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, and Ting Yao. Discriminative policy optimization for token-level reward models.arXiv preprint arXiv:2505.23363, 2025

arXiv 2025

[6] [6]

Research: Learning to reason with search for llms via reinforcement learning

M Chen, L Sun, T Li, H Sun, Y Zhou, C Zhu, H Wang, JZ Pan, W Zhang, H Chen, et al. Research: Learning to reason with search for llms via reinforcement learning. arxiv 2025.arXiv preprint arXiv:2503.19470

Pith/arXiv arXiv 2025

[7] [7]

Learning to reason for factuality.arXiv preprint arXiv:2508.05618, 2025

Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen-tau Yih. Learning to reason for factuality.arXiv preprint arXiv:2508.05618, 2025

arXiv 2025

[8] [8]

Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025

Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, and Fei-Yue Wang. Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025

arXiv 2025

[9] [9]

Decoding by contrasting layers improves factuality in large language models (dola).arXiv Preprint, pages 1–8, 2024

YS Chuang et al. Decoding by contrasting layers improves factuality in large language models (dola).arXiv Preprint, pages 1–8, 2024

2024

[10] [10]

Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

2024

[11] [11]

Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models

Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. arXiv e-prints, pages arXiv–2402, 2024

2024

[12] [12]

Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making.arXiv preprint arXiv:2602.06570, 2026

Chengfeng Dou, Fan Yang, Fei Li, Jiyuan Jia, Qiang Ju, Shuai Wang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Hongda Zhang, et al. Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making.arXiv preprint arXiv:2602.06570, 2026

arXiv 2026

[13] [13]

Good learners think their thinking: Generative prm makes large reasoning model more efficient math learner.arXiv preprint arXiv:2507.23317, 2025

Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, and Bing Qin. Good learners think their thinking: Generative prm makes large reasoning model more efficient math learner.arXiv preprint arXiv:2507.23317, 2025

arXiv 2025

[14] [14]

Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267, 2024

Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, and Irwin King. Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267, 2024. 10

arXiv 2024

[15] [15]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

2023

[16] [16]

Ai-generated news articles based on large language models

Kai Jiang, Qilai Zhang, Dongsheng Guo, Dengrong Huang, Sijia Zhang, Zizhong Wei, Fanggang Ning, and Rui Li. Ai-generated news articles based on large language models. InProceedings of the 2023 International Conference on Artificial Intelligence, Systems and Network Security, pages 82–87, 2023

2023

[17] [17]

Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

Haoqiang Kang and Xiao-Yang Liu. Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

arXiv 2023

[18] [18]

Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777, 2025

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777, 2025

arXiv 2025

[19] [19]

Know the unknown: An uncertainty-sensitive method for llm instruction tuning

Jiaqi Li, Yixuan Tang, and Yi Yang. Know the unknown: An uncertainty-sensitive method for llm instruction tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2972–2989, 2025

2025

[20] [20]

Knowledge-level consistency reinforcement learning: Dual-fact alignment for long-form factuality, 2026

Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, and Haifeng Wang. Knowledge-level consistency reinforcement learning: Dual-fact alignment for long-form factuality, 2026. URLhttps://openreview.net/forum?id=Q04RwdeN9z

2026

[21] [21]

Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models

Junyi Li and Hwee Tou Ng. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id=Igq7Dyc3OL

2026

[22] [22]

finllm-eval: Evaluation methods for the logical, factual, and data accuracy of large models in the financial domain.https://www.github.com/Tencent/finLLM-Eval, 2025

Lichang Liang, Shaohui Wu, Yunlong Zhang, Yeyang Tang, and Fanyang Lu. finllm-eval: Evaluation methods for the logical, factual, and data accuracy of large models in the financial domain.https://www.github.com/Tencent/finLLM-Eval, 2025

2025

[23] [23]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

2023

[24] [24]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

Pith/arXiv arXiv 2025

[25] [25]

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023

[26] [26]

Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711, 2025

Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711, 2025

arXiv 2025

[27] [27]

Entity-level factual consistency of abstractive text summarization

Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. Entity-level factual consistency of abstractive text summarization. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, 2021

2021

[28] [28]

Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, 2024

2024

[29] [29]

Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, 2021. 11

2021

[30] [30]

Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807, 2025

Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807, 2025

Pith/arXiv arXiv 2025

[31] [31]

A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1, 2024

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1, 2024

Pith/arXiv arXiv 2024

[32] [32]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

Pith/arXiv arXiv 2024

[33] [33]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Pith/arXiv arXiv 2024

[34] [34]

Fspo: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. Fspo: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

Pith/arXiv arXiv 2025

[35] [35]

R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

Pith/arXiv arXiv 2025

[36] [36]

Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

arXiv 2025

[37] [37]

Mechanistic detection and mitigation of hallucination in large reasoning models

Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, and Jun Xu. Mechanistic detection and mitigation of hallucination in large reasoning models. InThe Fourteenth International Conference on Learning Representations

[38] [38]

Benchmarking llm faithfulness in rag with evolving leaderboards

Manveer Singh Tamber, Forrest Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, and Jimmy Lin. Benchmarking llm faithfulness in rag with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 799–811, 2025

2025

[39] [39]

Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349, 2025

Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349, 2025

arXiv 2025

[40] [40]

Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

Pith/arXiv arXiv 2022

[41] [41]

On-policy self-alignment with fine-grained knowledge feedback for hallucination mitigation

Xueru Wen, Jie Lou, Xinyu Lu, Yuqiu Ji, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, et al. On-policy self-alignment with fine-grained knowledge feedback for hallucination mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5215–5231, 2025

2025

[42] [42]

Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

arXiv 2025

[43] [43]

Toward reliable scientific hypothesis generation: Evaluat- ing truthfulness and hallucination in large language models.arXiv preprint arXiv:2505.14599, 2025

Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, and Aidong Zhang. Toward reliable scientific hypothesis generation: Evaluat- ing truthfulness and hallucination in large language models.arXiv preprint arXiv:2505.14599, 2025

arXiv 2025

[44] [44]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[45] [45]

Mitigating hallucination in financial retrieval-augmented generation via fine-grained knowledge verification.arXiv preprint arXiv:2602.05723, 2026

Taoye Yin, Haoyuan Hu, Yaxin Fan, Xinhao Chen, Xinya Wu, Kai Deng, Kezun Zhang, and Feng Wang. Mitigating hallucination in financial retrieval-augmented generation via fine-grained knowledge verification.arXiv preprint arXiv:2602.05723, 2026. 12

arXiv 2026

[46] [46]

Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback

Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Nam, Daejin Jo, Kyoung- Woon On, Mark Hasegawa-Johnson, Sungwoong Kim, and Chang Yoo. Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14969–14981, 2024

2024

[47] [47]

Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

Pith/arXiv arXiv 2025

[48] [48]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, 2024

2024

[49] [49]

Reward- sql: Boosting text-to-sql via stepwise reasoning and process-supervised rewards.arXiv preprint arXiv:2505.04671, 2025

Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li. Reward- sql: Boosting text-to-sql via stepwise reasoning and process-supervised rewards.arXiv preprint arXiv:2505.04671, 2025

arXiv 2025

[50] [50]

Enhancing zero-shot chain-of-thought reasoning in large language models through logic

Xufeng Zhao, Mengdi Li, Wenhao Lu, Cornelius Weber, Jae-Hee Lee, Kun Chu, and Stefan Wermter. Enhancing zero-shot chain-of-thought reasoning in large language models through logic. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6144–6166, 2024

2024

[51] [51]

TX t=1 Abal t ∇θ logπ θ(yt|x, y<t) # =E y∼πθ

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 13 A Related Work A.1 Hallucination In LLMs, hallucinations refer to syntactically fluent outputs that contain factual errors ...

2023

[52] [52]

Response Text

**Extraction:** Extract all informational sentences from the "Response Text." - Extracted sentences typically contain specific data pointssuch as figures ( dates, times, and various other numerical values), entities (people, places, venues, etc.), and logical relationships (e.g., whether a cause-and-effect link holds true)

[53] [53]

Reference Materials

**Localization:** Locate the corresponding paragraphs within the "Reference Materials."

[54] [54]

**Comparison:** Compare the statements made in the response against the original wording found in the reference materials

[55] [55]

**Verdict:** - **Correct:** The statement aligns perfectly with the content found in the reference materials. - **Incorrect:** A corresponding statement exists in the reference materials, but the response’s phrasing does not match it; *or* the statement cannot be found within the specified reference materials (i.e., it is fabricated/ hallucinated). 19

[56] [56]

erroneous segment

**Error Localization:** For every incorrect item, further pinpoint the specific "erroneous segment" (‘error_spans‘) within the "Response Text" itself, to facilitate subsequent token-level penalties. - For each incorrect item, you must identify the minimal erroneous segment within the [Response Text] and output it to the ‘error_spans‘ list. - Each entry in...

[57] [57]

**Precise Numerical Replacement (Primary Choice)**: Modify only the incorrect number (e.g., ‘17.96%‘ ‘12.08%‘)

[58] [58]

**Precise Entity Replacement (Secondary Choice)**: If changing only the number results in a semantic contradiction (e.g., a subject mismatch), make minor adjustments to the local subject entity name or qualifier (e.g., ‘Bond D‘ ‘ Bond A‘)

[59] [59]

**Local Refinement (Tertiary Choice)**: If simply replacing the number or entity leads to grammatical errors, semantic incoherence, or logical conflicts within the context, you are permitted to make minor additions, deletions, or stylistic tweaks to the specific sentenceor its immediate contextto ensure it reads smoothly

[60] [60]

hallucination

**Full Sentence Deletion (Last Resort)**: If the erroneous information is a complete fabrication (a "hallucination") that cannot be factually corrected through simple editsor if correcting it would cause the logic of the entire paragraph to collapsedirectly delete the entire sentence containing the error . ## Step 3: Formatting Requirements - Preserve the...