pith. sign in

arxiv: 2606.15893 · v2 · pith:UWAWVNMFnew · submitted 2026-06-14 · 💻 cs.CL

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

Pith reviewed 2026-06-27 04:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords hallucination mitigationtoken-level policy optimizationreinforcement learninglarge language modelsclaim verificationfaithfulnessRAGcredit assignment
0
0 comments X

The pith

BALTO uses balanced token-level credit assignment to reduce hallucinations in LLMs while preserving response informativeness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate hallucinations when responses must stay faithful to provided evidence. Response-level reinforcement learning rewards create a granularity mismatch because one local error can penalize supported content. Claim-level verification offers finer signals but still suffers from unbalanced credit assignment that favors length or verbosity. BALTO extracts checkable claims, verifies them against the reference, projects the labels to tokens, and applies a balanced credit mechanism that moves probability mass from unsupported tokens toward faithful ones. Experiments across six model-benchmark combinations show BALTO reaches the highest faithfulness scores and a stronger faithfulness-informativeness trade-off than prior post-training methods.

Core claim

BALTO extracts factual claims from model outputs, verifies each claim against the reference context, projects the resulting judgments to token-level labels, and applies a balanced token-level credit assignment that redistributes probability mass from unsupported content to faithful content; theoretical analysis establishes that this design improves training stability and optimization efficiency over response-level rewards for hallucination mitigation.

What carries the argument

The balanced token-level credit assignment mechanism that redistributes probability mass from unsupported content toward faithful content instead of suppressing entire responses.

If this is right

  • BALTO attains the highest faithfulness across all six model-benchmark settings tested.
  • It consistently exceeds existing post-training baselines on Q-Score.
  • It exhibits a stronger faithfulness-informativeness trade-off than prior methods.
  • Training stability and optimization efficiency improve for hallucination mitigation tasks.
  • Response-level rewards are shown to suffer from granularity mismatch that token-level balancing avoids.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection and balancing steps could be tested on multi-turn or open-ended generation tasks where evidence is only partially available.
  • If the token labels remain unbiased across domains, the framework might reduce reliance on human preference data for general alignment objectives.
  • Applying BALTO to models that already use retrieval-augmented generation could isolate whether the gains come mainly from the credit mechanism or from the claim extraction step.
  • Measuring the variance of token-level rewards before and after balancing would provide a direct check on the claimed stability improvement.

Load-bearing premise

The projection from claim-level verification judgments to token-level labels can be performed without introducing systematic bias or requiring extra fitted parameters that depend on the target faithfulness metric.

What would settle it

A new model-benchmark pair in which BALTO does not achieve the highest faithfulness score or underperforms existing baselines on Q-Score.

Figures

Figures reproduced from arXiv: 2606.15893 by Chang Luo, Ning Li, Weinan Zhang, Weiwen Liu, Wenbo Fei, Yan Xu, Yasheng Wang, Yifan Niu, Yong Yu, Zixuan Guo.

Figure 1
Figure 1. Figure 1: Pilot study on FinLLM-Eval with Qwen3-8B. Hallucinations are sparse within individual [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the BALTO framework is demonstrated. Initially, the reward model localize [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gradient norm distribution on RAGTruth. 6 8 10 12 14 16 Model Parameter Update (%) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Faith. BALTO FSPO GRPO_B GRPO_D [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Faithfulness improvement during training across different RL methods on three datasets. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO's advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model--benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness--informativeness trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes BALTO, a token-level RL framework for mitigating hallucinations in LLMs. It extracts verifiable claims from responses, verifies them against reference context, projects the judgments to token-level labels, and applies a balanced credit-assignment mechanism that redistributes probability mass from unsupported to supported tokens. The authors provide a theoretical analysis of response-level reward limitations and prove BALTO's advantages in training stability and optimization efficiency; experiments on ConFiQA, RAGTruth, and FinLLM-Eval across six model-benchmark pairs report highest faithfulness and improved Q-Score over post-training baselines.

Significance. If the central claims hold, BALTO would represent a meaningful advance in fine-grained RL for hallucination mitigation by addressing granularity mismatch without suppressing entire responses. The theoretical stability analysis and consistent outperformance on faithfulness-informativeness trade-off would be notable strengths for knowledge-intensive applications.

major comments (2)
  1. [Abstract / §3] Abstract and §3 (method): the projection from claim-level verification judgments to token-level labels is load-bearing for both the theoretical guarantees and the reported experimental gains, yet the manuscript provides no analysis showing this projection is free of systematic bias (length, position, or verifier-dependent correlations) or additional fitted parameters whose tuning depends on the downstream Q-Score metric. The stability proof assumes clean token labels and therefore does not address this upstream step.
  2. [§4] §4 (experiments): the claim that BALTO achieves the highest faithfulness across all six settings and a stronger trade-off rests on the projection step; without ablations isolating the projection rule from the balanced credit-assignment mechanism, it is unclear whether the reported superiority is attributable to the proposed method or to the label-generation procedure itself.
minor comments (2)
  1. [§3] Notation for the balanced credit-assignment term should be defined explicitly with an equation number rather than described only in prose.
  2. [§4] Dataset statistics (number of claims per response, verification accuracy of the claim extractor) are needed to assess the reliability of the token-label pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our paper. We address each major comment below, providing clarifications and committing to revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method): the projection from claim-level verification judgments to token-level labels is load-bearing for both the theoretical guarantees and the reported experimental gains, yet the manuscript provides no analysis showing this projection is free of systematic bias (length, position, or verifier-dependent correlations) or additional fitted parameters whose tuning depends on the downstream Q-Score metric. The stability proof assumes clean token labels and therefore does not address this upstream step.

    Authors: We appreciate the referee pointing out the importance of validating the projection step. The projection is a direct, parameter-free mapping: each token receives the label of the claim it belongs to, or neutral if not part of any claim. No additional parameters are fitted, and it does not depend on the Q-Score. However, we agree that explicit checks for systematic biases (e.g., longer claims affecting more tokens, position biases) are absent from the current version. We will add an analysis section with empirical statistics on label distributions across the datasets and correlation checks with length and position. For the stability proof, it indeed assumes the token labels as input and analyzes the balanced credit assignment; we will update the text to explicitly state this scope and note that the projection is a preprocessing step. revision: yes

  2. Referee: [§4] §4 (experiments): the claim that BALTO achieves the highest faithfulness across all six settings and a stronger trade-off rests on the projection step; without ablations isolating the projection rule from the balanced credit-assignment mechanism, it is unclear whether the reported superiority is attributable to the proposed method or to the label-generation procedure itself.

    Authors: We acknowledge that isolating the contribution of the balanced credit-assignment from the projection would provide clearer attribution. The projection is a necessary component to enable token-level rewards, but the key innovation is the balanced redistribution mechanism. To address this, we will include additional ablation experiments in the revised manuscript, such as comparing the balanced mechanism against a standard token-level RL using the same projection, and variants with different projection rules if applicable. This will help demonstrate that the gains come from the proposed balanced assignment. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; projection and proof presented as independent steps

full rationale

The paper introduces a claim-to-token projection and balanced credit assignment as novel mechanisms, followed by a theoretical analysis of response-level reward limitations and a stability proof for BALTO. No equations, fitted parameters, or self-citations are visible that reduce the claimed faithfulness gains or stability advantages back to the input labels or prior author work by construction. The central claims rest on the described pipeline and external benchmarks rather than self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted. The method implicitly assumes accurate claim extraction and verification as a prerequisite step.

pith-pipeline@v0.9.1-grok · 5789 in / 1043 out tokens · 42147 ms · 2026-06-27T04:03:49.308251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 11 linked inside Pith

  1. [1]

    Large language models hallucination: A comprehensive survey.Computer Science Review, 61:100970, 2026

    Aisha Alansari and Hamzah Luqman. Large language models hallucination: A comprehensive survey.Computer Science Review, 61:100970, 2026

  2. [2]

    Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

    Patrice Béchard and Orlando Marquez Ayala. Reducing hallucination in structured outputs via retrieval-augmented generation.arXiv preprint arXiv:2404.08189, 2024

  3. [3]

    Context-dpo: Aligning language models for context-faithfulness

    Baolong Bi, Shaohan Huang, Yiwei Wang, Tianchi Yang, Zihan Zhang, Haizhen Huang, Lingrui Mei, Junfeng Fang, Zehao Li, Furu Wei, et al. Context-dpo: Aligning language models for context-faithfulness. InFindings of the Association for Computational Linguistics: ACL 2025, pages 10280–10300, 2025

  4. [4]

    Dense reward for free in reinforcement learning from human feedback.arXiv preprint arXiv:2402.00782, 2024

    Alex J Chan, Hao Sun, Samuel Holt, and Mihaela Van Der Schaar. Dense reward for free in reinforcement learning from human feedback.arXiv preprint arXiv:2402.00782, 2024

  5. [5]

    Discriminative policy optimization for token-level reward models.arXiv preprint arXiv:2505.23363, 2025

    Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, and Ting Yao. Discriminative policy optimization for token-level reward models.arXiv preprint arXiv:2505.23363, 2025

  6. [6]

    Research: Learning to reason with search for llms via reinforcement learning

    M Chen, L Sun, T Li, H Sun, Y Zhou, C Zhu, H Wang, JZ Pan, W Zhang, H Chen, et al. Research: Learning to reason with search for llms via reinforcement learning. arxiv 2025.arXiv preprint arXiv:2503.19470

  7. [7]

    Learning to reason for factuality.arXiv preprint arXiv:2508.05618, 2025

    Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen-tau Yih. Learning to reason for factuality.arXiv preprint arXiv:2508.05618, 2025

  8. [8]

    Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025

    Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, and Fei-Yue Wang. Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025

  9. [9]

    Decoding by contrasting layers improves factuality in large language models (dola).arXiv Preprint, pages 1–8, 2024

    YS Chuang et al. Decoding by contrasting layers improves factuality in large language models (dola).arXiv Preprint, pages 1–8, 2024

  10. [10]

    Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

    Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. Large legal fictions: Profiling legal hallucinations in large language models.Journal of Legal Analysis, 16(1):64–93, 2024

  11. [11]

    Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models

    Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. Retrieve only when it needs: Adaptive retrieval augmentation for hallucination mitigation in large language models. arXiv e-prints, pages arXiv–2402, 2024

  12. [12]

    Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making.arXiv preprint arXiv:2602.06570, 2026

    Chengfeng Dou, Fan Yang, Fei Li, Jiyuan Jia, Qiang Ju, Shuai Wang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Hongda Zhang, et al. Baichuan-m3: Modeling clinical inquiry for reliable medical decision-making.arXiv preprint arXiv:2602.06570, 2026

  13. [13]

    Good learners think their thinking: Generative prm makes large reasoning model more efficient math learner.arXiv preprint arXiv:2507.23317, 2025

    Tao He, Rongchuan Mu, Lizi Liao, Yixin Cao, Ming Liu, and Bing Qin. Good learners think their thinking: Generative prm makes large reasoning model more efficient math learner.arXiv preprint arXiv:2507.23317, 2025

  14. [14]

    Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267, 2024

    Minda Hu, Bowei He, Yufei Wang, Liangyou Li, Chen Ma, and Irwin King. Mitigating large language model hallucination with faithful finetuning.arXiv preprint arXiv:2406.11267, 2024. 10

  15. [15]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM computing surveys, 55(12):1–38, 2023

  16. [16]

    Ai-generated news articles based on large language models

    Kai Jiang, Qilai Zhang, Dongsheng Guo, Dengrong Huang, Sijia Zhang, Zizhong Wei, Fanggang Ning, and Rui Li. Ai-generated news articles based on large language models. InProceedings of the 2023 International Conference on Artificial Intelligence, Systems and Network Security, pages 82–87, 2023

  17. [17]

    Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

    Haoqiang Kang and Xiao-Yang Liu. Deficiency of large language models in finance: An empirical examination of hallucination.arXiv preprint arXiv:2311.15548, 2023

  18. [18]

    Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777, 2025

    Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, et al. Medical hallucinations in foundation models and their impact on healthcare.arXiv preprint arXiv:2503.05777, 2025

  19. [19]

    Know the unknown: An uncertainty-sensitive method for llm instruction tuning

    Jiaqi Li, Yixuan Tang, and Yi Yang. Know the unknown: An uncertainty-sensitive method for llm instruction tuning. InFindings of the Association for Computational Linguistics: ACL 2025, pages 2972–2989, 2025

  20. [20]

    Knowledge-level consistency reinforcement learning: Dual-fact alignment for long-form factuality, 2026

    Junliang Li, Yucheng Wang, Yan Chen, Yu Ran, Ruiqing Zhang, Jing Liu, Hua Wu, and Haifeng Wang. Knowledge-level consistency reinforcement learning: Dual-fact alignment for long-form factuality, 2026. URLhttps://openreview.net/forum?id=Q04RwdeN9z

  21. [21]

    Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models

    Junyi Li and Hwee Tou Ng. Reasoning models hallucinate more: Factuality-aware reinforcement learning for large reasoning models. InThe Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2026. URLhttps://openreview.net/forum?id=Igq7Dyc3OL

  22. [22]

    finllm-eval: Evaluation methods for the logical, factual, and data accuracy of large models in the financial domain.https://www.github.com/Tencent/finLLM-Eval, 2025

    Lichang Liang, Shaohui Wu, Yunlong Zhang, Yeyang Tang, and Fanyang Lu. finllm-eval: Evaluation methods for the logical, factual, and data accuracy of large models in the financial domain.https://www.github.com/Tencent/finLLM-Eval, 2025

  23. [23]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  24. [24]

    Deepseek-v3

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  25. [25]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  26. [26]

    Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711, 2025

    Sagnik Mukherjee, Lifan Yuan, Dilek Hakkani-Tur, and Hao Peng. Reinforcement learning finetunes small subnetworks in large language models.arXiv preprint arXiv:2505.11711, 2025

  27. [27]

    Entity-level factual consistency of abstractive text summarization

    Feng Nan, Ramesh Nallapati, Zhiguo Wang, Cicero Nogueira dos Santos, Henghui Zhu, Dejiao Zhang, Kathleen McKeown, and Bing Xiang. Entity-level factual consistency of abstractive text summarization. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2727–2733, 2021

  28. [28]

    Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, KaShun Shum, Randy Zhong, Juntong Song, and Tong Zhang. Ragtruth: A hallucination corpus for developing trustworthy retrieval- augmented language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862–10878, 2024

  29. [29]

    Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics

    Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, 2021. 11

  30. [30]

    Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807, 2025

    Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality.arXiv preprint arXiv:2506.19807, 2025

  31. [31]

    A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1, 2024

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927, 1, 2024

  32. [32]

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  33. [33]

    Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  34. [34]

    Fspo: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

    Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, and Chelsea Finn. Fspo: Few-shot optimization of synthetic preferences personalizes to real users.arXiv preprint arXiv:2502.19312, 2025

  35. [35]

    R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

    Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, and Ji-Rong Wen. R1-searcher: Incentivizing the search capability in llms via reinforcement learning.arXiv preprint arXiv:2503.05592, 2025

  36. [36]

    Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

    Wei Sun, Wen Yang, Pu Jian, Qianlong Du, Fuwei Cui, Shuo Ren, and Jiajun Zhang. Ktae: A model-free algorithm to key-tokens advantage estimation in mathematical reasoning.arXiv preprint arXiv:2505.16826, 2025

  37. [37]

    Mechanistic detection and mitigation of hallucination in large reasoning models

    Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, and Jun Xu. Mechanistic detection and mitigation of hallucination in large reasoning models. InThe Fourteenth International Conference on Learning Representations

  38. [38]

    Benchmarking llm faithfulness in rag with evolving leaderboards

    Manveer Singh Tamber, Forrest Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, and Jimmy Lin. Benchmarking llm faithfulness in rag with evolving leaderboards. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 799–811, 2025

  39. [39]

    Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349, 2025

    Hongze Tan, Zihan Wang, Jianfei Pan, Jinghao Lin, Hao Wang, Yifan Wu, Tao Chen, Zhihang Zheng, Zhihao Tang, and Haihua Yang. Gtpo and grpo-s: Token and sequence-level reward shaping with policy entropy.arXiv preprint arXiv:2508.04349, 2025

  40. [40]

    Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  41. [41]

    On-policy self-alignment with fine-grained knowledge feedback for hallucination mitigation

    Xueru Wen, Jie Lou, Xinyu Lu, Yuqiu Ji, Xinyan Guan, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Debing Zhang, et al. On-policy self-alignment with fine-grained knowledge feedback for hallucination mitigation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 5215–5231, 2025

  42. [42]

    Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

    Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, and Xiao Zhang. Capo: Towards enhancing llm reasoning through generative credit assignment.arXiv preprint arXiv:2508.02298, 2025

  43. [43]

    Toward reliable scientific hypothesis generation: Evaluat- ing truthfulness and hallucination in large language models.arXiv preprint arXiv:2505.14599, 2025

    Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, and Aidong Zhang. Toward reliable scientific hypothesis generation: Evaluat- ing truthfulness and hallucination in large language models.arXiv preprint arXiv:2505.14599, 2025

  44. [44]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  45. [45]

    Mitigating hallucination in financial retrieval-augmented generation via fine-grained knowledge verification.arXiv preprint arXiv:2602.05723, 2026

    Taoye Yin, Haoyuan Hu, Yaxin Fan, Xinhao Chen, Xinya Wu, Kai Deng, Kezun Zhang, and Feng Wang. Mitigating hallucination in financial retrieval-augmented generation via fine-grained knowledge verification.arXiv preprint arXiv:2602.05723, 2026. 12

  46. [46]

    Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback

    Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Nam, Daejin Jo, Kyoung- Woon On, Mark Hasegawa-Johnson, Sungwoong Kim, and Chang Yoo. Tlcr: Token-level continuous reward for fine-grained reinforcement learning from human feedback. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14969–14981, 2024

  47. [47]

    Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  48. [48]

    Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13643–13658, 2024

  49. [49]

    Reward- sql: Boosting text-to-sql via stepwise reasoning and process-supervised rewards.arXiv preprint arXiv:2505.04671, 2025

    Yuxin Zhang, Meihao Fan, Ju Fan, Mingyang Yi, Yuyu Luo, Jian Tan, and Guoliang Li. Reward- sql: Boosting text-to-sql via stepwise reasoning and process-supervised rewards.arXiv preprint arXiv:2505.04671, 2025

  50. [50]

    Enhancing zero-shot chain-of-thought reasoning in large language models through logic

    Xufeng Zhao, Mengdi Li, Wenhao Lu, Cornelius Weber, Jae-Hee Lee, Kun Chu, and Stefan Wermter. Enhancing zero-shot chain-of-thought reasoning in large language models through logic. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 6144–6166, 2024

  51. [51]

    TX t=1 Abal t ∇θ logπ θ(yt|x, y<t) # =E y∼πθ

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023. 13 A Related Work A.1 Hallucination In LLMs, hallucinations refer to syntactically fluent outputs that contain factual errors ...

  52. [52]

    Response Text

    **Extraction:** Extract all informational sentences from the "Response Text." - Extracted sentences typically contain specific data pointssuch as figures ( dates, times, and various other numerical values), entities (people, places, venues, etc.), and logical relationships (e.g., whether a cause-and-effect link holds true)

  53. [53]

    Reference Materials

    **Localization:** Locate the corresponding paragraphs within the "Reference Materials."

  54. [54]

    **Comparison:** Compare the statements made in the response against the original wording found in the reference materials

  55. [55]

    **Verdict:** - **Correct:** The statement aligns perfectly with the content found in the reference materials. - **Incorrect:** A corresponding statement exists in the reference materials, but the response’s phrasing does not match it; *or* the statement cannot be found within the specified reference materials (i.e., it is fabricated/ hallucinated). 19

  56. [56]

    erroneous segment

    **Error Localization:** For every incorrect item, further pinpoint the specific "erroneous segment" (‘error_spans‘) within the "Response Text" itself, to facilitate subsequent token-level penalties. - For each incorrect item, you must identify the minimal erroneous segment within the [Response Text] and output it to the ‘error_spans‘ list. - Each entry in...

  57. [57]

    **Precise Numerical Replacement (Primary Choice)**: Modify only the incorrect number (e.g., ‘17.96%‘ ‘12.08%‘)

  58. [58]

    **Precise Entity Replacement (Secondary Choice)**: If changing only the number results in a semantic contradiction (e.g., a subject mismatch), make minor adjustments to the local subject entity name or qualifier (e.g., ‘Bond D‘ ‘ Bond A‘)

  59. [59]

    **Local Refinement (Tertiary Choice)**: If simply replacing the number or entity leads to grammatical errors, semantic incoherence, or logical conflicts within the context, you are permitted to make minor additions, deletions, or stylistic tweaks to the specific sentenceor its immediate contextto ensure it reads smoothly

  60. [60]

    hallucination

    **Full Sentence Deletion (Last Resort)**: If the erroneous information is a complete fabrication (a "hallucination") that cannot be factually corrected through simple editsor if correcting it would cause the logic of the entire paragraph to collapsedirectly delete the entire sentence containing the error . ## Step 3: Formatting Requirements - Preserve the...