Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents
Pith reviewed 2026-07-02 13:09 UTC · model grok-4.3
The pith
A fine-tuned 2B model detects span-level hallucinations over code, tool outputs, and documents at 0.689 F1.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a unified benchmark for span-level hallucination detection across code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is constructed by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split. Their fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, substantially outperforming LettuceDetect-large at 0.17 and the strongest zero-shot LLM judges at most 0.22, while remaining competitive on established natural-language benchmarks.
What carries the argument
The unified benchmark built by injecting localized hallucinations into grounded correct answers with exact character labels.
If this is right
- Span-level detectors can now be trained and evaluated on structured inputs such as code and tool output in addition to documents.
- A 2B-parameter model fine-tuned on the benchmark outperforms both prior detectors and zero-shot LLM judges on the new data sources.
- The same fine-tuned model maintains competitive performance on existing natural-language hallucination benchmarks.
- The injection method with character-level labels enables precise span evaluation across multiple input types.
Where Pith is reading between the lines
- Developers building code agents could integrate such a detector to flag unsupported spans before execution.
- The benchmark construction approach might be adapted to other structured domains such as database queries or API responses.
- Fine-tuning on synthetic localized errors may reduce reliance on large zero-shot judges for structured hallucination tasks.
Load-bearing premise
Starting from grounded correct answers and injecting localized hallucinations with exact labels produces test cases representative of real hallucinations in grounded generation systems that use code and tool outputs.
What would settle it
Direct comparison of the injected hallucinations against actual errors made by deployed code agents on real tasks would show whether the benchmark distributions match real usage.
Figures
read the original abstract
Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and natural-language RAG datasets. The benchmark is constructed by starting from grounded correct answers and injecting localized hallucinations with exact character-span labels; the code split receives an evidence-based review. A fine-tuned Qwen3.5-2B detector achieves 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, substantially outperforming LettuceDetect-large (0.17) and the strongest zero-shot LLM judges (≤0.22). The same model remains competitive on established NL benchmarks (81.8 RAGTruth example-F1, 0.724 English PsiloQA IoU).
Significance. If the benchmark construction yields representative test cases, the work supplies a practical detector and evaluation resource for grounded generation systems that consume non-textual inputs. The reported gains on the code-agent source and the use of exact span labels for fine-grained evaluation are concrete strengths. The competitive retention of performance on existing NL benchmarks further supports the unified approach.
major comments (1)
- [Abstract] Abstract / benchmark construction: the headline claim of superiority on the code-agent source (0.60 span-F1) rests on test cases generated by localized character-level injection. Real code-agent hallucinations frequently involve non-local phenomena (incorrect API semantics, control-flow errors, type mismatches across calls). The evidence-based review validates the injected labels but does not establish that the synthetic distribution matches the error distribution of actual grounded generators; this directly affects the interpretation of outperformance over LettuceDetect and zero-shot judges.
minor comments (1)
- [Abstract] The abstract refers to 'existing natural-language RAG datasets' without naming the specific corpora or describing integration details; this should be clarified in the methods section for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback on our benchmark construction. We address the major comment point-by-point below and outline planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract / benchmark construction: the headline claim of superiority on the code-agent source (0.60 span-F1) rests on test cases generated by localized character-level injection. Real code-agent hallucinations frequently involve non-local phenomena (incorrect API semantics, control-flow errors, type mismatches across calls). The evidence-based review validates the injected labels but does not establish that the synthetic distribution matches the error distribution of actual grounded generators; this directly affects the interpretation of outperformance over LettuceDetect and zero-shot judges.
Authors: We agree that the benchmark is constructed via localized character-level injection into grounded answers (Section 3), which enables precise span annotations unavailable in most natural hallucination corpora. The evidence-based review on the code split confirms label validity but, as the referee notes, does not prove distributional equivalence to real generator errors. We do not claim such equivalence; the benchmark is explicitly positioned as a controlled testbed for span-level detection of localized hallucinations across modalities. The reported gains (including 0.60 span-F1 on the code-agent source) therefore demonstrate relative effectiveness on this synthetic distribution rather than a universal claim about real-world error coverage. The competitive retention of performance on RAGTruth and PsiloQA provides supporting evidence of broader utility. In revision we will (1) add an explicit limitations subsection clarifying the synthetic/localized scope and (2) temper abstract language to avoid implying direct equivalence to naturalistic distributions. revision: partial
Circularity Check
No significant circularity
full rationale
The paper constructs an explicit synthetic benchmark by injecting localized hallucinations into grounded answers and reports measured span-F1 on the resulting test splits. No equations, parameter fits, or derivations are present that reduce the headline performance numbers to tautological definitions or self-citation chains. Benchmark construction and evaluation follow standard empirical practice with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Localized hallucinations can be injected into grounded correct answers with exact character labels to create realistic test cases for span-level detection.
Reference graph
Works this paper leans on
-
[1]
LettuceDetect: A Hallucination Detection Framework for
Kov. LettuceDetect: A Hallucination Detection Framework for. 2025 , eprint =
2025
-
[2]
Retrieval-augmented generation for knowledge-intensive NLP tasks , year =
Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =
-
[3]
RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models
Niu, Cheng and Wu, Yuanhao and Zhu, Juno and Xu, Siliang and Shum, KaShun and Zhong, Randy and Song, Juntong and Zhang, Tong. RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v...
-
[4]
2024 , eprint=
Fine-grained Hallucination Detection and Editing for Language Models , author=. 2024 , eprint=
2024
-
[5]
Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557
-
[6]
H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397
-
[7]
L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost
Belyi, Masha and Friel, Robert and Shao, Shuai and Sanyal, Atindriyo. L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 2025
2025
-
[8]
Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in...
-
[9]
2025 , eprint=
CodeMirage: Hallucinations in Code Generated by Large Language Models , author=. 2025 , eprint=
2025
-
[10]
2024 , eprint=
Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code , author=. 2024 , eprint=
2024
-
[11]
AgentHallu: Benchmarking Automated Hallucination Attribution of
Liu, Xuannan and Yang, Xiao and Li, Zekun and Li, Peipei and He, Ran , year =. AgentHallu: Benchmarking Automated Hallucination Attribution of. 2601.06818 , archivePrefix =
-
[12]
2024 , eprint=
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=
2024
-
[13]
2026 , eprint =
Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents , author =. 2026 , eprint =
2026
-
[14]
ACL-Verbatim: hallucination-free question answering for research
Recski, Gabor and Toth, Szilveszter and Verdha, Nadia and Boros, Istvan and Kovacs, Adam , year =. 2605.21102 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
2026 , url =
Open Wikipedia (Markdown) , author =. 2026 , url =
2026
-
[16]
2026 , howpublished =
Gemma 4 31B IT Model Card , author =. 2026 , howpublished =
2026
-
[17]
Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
-
[18]
2025 , eprint=
mmBERT: A Modern Multilingual Encoder with Annealed Language Learning , author=. 2025 , eprint=
2025
-
[19]
doi:10.57967/hf/3240 , publisher =
Miaoran Li and Rogger Luo and Ofer Mendelevitch , title =. doi:10.57967/hf/3240 , publisher =
-
[20]
2024 , eprint=
Lynx: An Open Source Hallucination Evaluation Model , author=. 2024 , eprint=
2024
-
[21]
and Hind, Michael and Geyer, Werner and Rawat, Ambrish and Varshney, Kush R
Padhi, Inkit and Nagireddy, Manish and Cornacchia, Giandomenico and Chaudhury, Subhajit and Pedapati, Tejaswini and Dognin, Pierre and Murugesan, Keerthiram and Miehling, Erik and Santill \'a n Cooper, Mart \'i n and Fraser, Kieran and Zizzo, Giulio and Hameed, Muhammad Zaid and Purcell, Mark and Desmond, Michael and Pan, Qian and Vejsbjerg, Inge and Daly...
-
[22]
2026 , eprint=
Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning , author=. 2026 , eprint=
2026
-
[23]
2025 , eprint=
gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=
2025
-
[24]
arXiv preprint arXiv:2511.23404 , year =
LFM2 Technical Report , author =. arXiv preprint arXiv:2511.23404 , year =
-
[25]
Liquid AI Blog , year =
Liquid AI , title =. Liquid AI Blog , year =
-
[26]
RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation
Song, Juntong and Wang, Xingguang and Zhu, Juno and Wu, Yuanhao and Cheng, Xuxin and Zhong, Randy and Niu, Cheng. RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.113
-
[27]
2025 , eprint=
Learning to Reason for Hallucination Span Detection , author=. 2025 , eprint=
2025
-
[28]
M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents
Tang, Liyan and Laban, Philippe and Durrett, Greg. M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.499
-
[29]
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA
Rykov, Elisei and Petrushina, Kseniia and Savkin, Maksim and Olisov, Valerii and Vazhentsev, Artem and Titova, Kseniia and Panchenko, Alexander and Konovalov, Vasily and Belikova, Julia. When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. do...
-
[30]
S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
Vazquez, Raul and Mickus, Timothee and Zosa, Elaine and Vahtola, Teemu and Tiedemann, J. S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025). 2025
2025
-
[31]
2026 , eprint=
Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks , author=. 2026 , eprint=
2026
-
[32]
2026 , eprint=
Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems , author=. 2026 , eprint=
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.