pith. sign in

arxiv: 2607.00895 · v1 · pith:T7PUTPAZnew · submitted 2026-07-01 · 💻 cs.CL

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

Pith reviewed 2026-07-02 13:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords hallucination detectionspan-level detectionretrieval-augmented generationcode generationtool usebenchmark construction
0
0 comments X

The pith

A fine-tuned 2B model detects span-level hallucinations over code, tool outputs, and documents at 0.689 F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark that extends hallucination detection beyond natural-language documents to include source code, tool outputs, markdown, tables, and repository metadata. It generates test data by taking correct grounded answers and inserting localized hallucinations marked at the exact character level, then validates the code portion through evidence review. A Qwen3.5-2B model fine-tuned on this data reaches 0.689 span-F1 on the full test set and 0.60 on the code-agent portion. The same model stays competitive with 81.8 example-F1 on RAGTruth and 0.724 IoU on English PsiloQA.

Core claim

The authors introduce a unified benchmark for span-level hallucination detection across code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is constructed by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split. Their fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, substantially outperforming LettuceDetect-large at 0.17 and the strongest zero-shot LLM judges at most 0.22, while remaining competitive on established natural-language benchmarks.

What carries the argument

The unified benchmark built by injecting localized hallucinations into grounded correct answers with exact character labels.

If this is right

  • Span-level detectors can now be trained and evaluated on structured inputs such as code and tool output in addition to documents.
  • A 2B-parameter model fine-tuned on the benchmark outperforms both prior detectors and zero-shot LLM judges on the new data sources.
  • The same fine-tuned model maintains competitive performance on existing natural-language hallucination benchmarks.
  • The injection method with character-level labels enables precise span evaluation across multiple input types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers building code agents could integrate such a detector to flag unsupported spans before execution.
  • The benchmark construction approach might be adapted to other structured domains such as database queries or API responses.
  • Fine-tuning on synthetic localized errors may reduce reliance on large zero-shot judges for structured hallucination tasks.

Load-bearing premise

Starting from grounded correct answers and injecting localized hallucinations with exact labels produces test cases representative of real hallucinations in grounded generation systems that use code and tool outputs.

What would settle it

Direct comparison of the injected hallucinations against actual errors made by deployed code agents on real tasks would show whether the benchmark distributions match real usage.

Figures

Figures reproduced from arXiv: 2607.00895 by \'Ad\'am Kov\'acs, Bowei He, G\'abor Recski, Istv\'an Boros, Szilveszter T\'oth, Xue Liu.

Figure 1
Figure 1. Figure 1: Edit-based injection yields exact spans. The injector returns each change as an [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example benchmark instance. We use [H] ... [/H] markers for gold unsupported spans. observation. ACL injections use paper-specific numerical, entity, relational, methodological, and citation-like edits detectable from the retrieved ex￾cerpts. README and Wikipedia injections use a generic markdown prompt covering numerical, temporal, entity, relational, fabricated-reference, and unsupported-claim edits. 4.4… view at source ↗
Figure 3
Figure 3. Figure 3: Reference grounding for code examples. Cor [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detector prompt used for generative detector training and zero-shot LLM-judge evaluation. For code-agent [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Question and clean-answer generation prompts. README and Wikipedia examples first generate a [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generic injection prompt. The model proposes structured replacement edits; the pipeline applies them [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and natural-language RAG datasets. The benchmark is constructed by starting from grounded correct answers and injecting localized hallucinations with exact character-span labels; the code split receives an evidence-based review. A fine-tuned Qwen3.5-2B detector achieves 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, substantially outperforming LettuceDetect-large (0.17) and the strongest zero-shot LLM judges (≤0.22). The same model remains competitive on established NL benchmarks (81.8 RAGTruth example-F1, 0.724 English PsiloQA IoU).

Significance. If the benchmark construction yields representative test cases, the work supplies a practical detector and evaluation resource for grounded generation systems that consume non-textual inputs. The reported gains on the code-agent source and the use of exact span labels for fine-grained evaluation are concrete strengths. The competitive retention of performance on existing NL benchmarks further supports the unified approach.

major comments (1)
  1. [Abstract] Abstract / benchmark construction: the headline claim of superiority on the code-agent source (0.60 span-F1) rests on test cases generated by localized character-level injection. Real code-agent hallucinations frequently involve non-local phenomena (incorrect API semantics, control-flow errors, type mismatches across calls). The evidence-based review validates the injected labels but does not establish that the synthetic distribution matches the error distribution of actual grounded generators; this directly affects the interpretation of outperformance over LettuceDetect and zero-shot judges.
minor comments (1)
  1. [Abstract] The abstract refers to 'existing natural-language RAG datasets' without naming the specific corpora or describing integration details; this should be clarified in the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our benchmark construction. We address the major comment point-by-point below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract / benchmark construction: the headline claim of superiority on the code-agent source (0.60 span-F1) rests on test cases generated by localized character-level injection. Real code-agent hallucinations frequently involve non-local phenomena (incorrect API semantics, control-flow errors, type mismatches across calls). The evidence-based review validates the injected labels but does not establish that the synthetic distribution matches the error distribution of actual grounded generators; this directly affects the interpretation of outperformance over LettuceDetect and zero-shot judges.

    Authors: We agree that the benchmark is constructed via localized character-level injection into grounded answers (Section 3), which enables precise span annotations unavailable in most natural hallucination corpora. The evidence-based review on the code split confirms label validity but, as the referee notes, does not prove distributional equivalence to real generator errors. We do not claim such equivalence; the benchmark is explicitly positioned as a controlled testbed for span-level detection of localized hallucinations across modalities. The reported gains (including 0.60 span-F1 on the code-agent source) therefore demonstrate relative effectiveness on this synthetic distribution rather than a universal claim about real-world error coverage. The competitive retention of performance on RAGTruth and PsiloQA provides supporting evidence of broader utility. In revision we will (1) add an explicit limitations subsection clarifying the synthetic/localized scope and (2) temper abstract language to avoid implying direct equivalence to naturalistic distributions. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs an explicit synthetic benchmark by injecting localized hallucinations into grounded answers and reports measured span-F1 on the resulting test splits. No equations, parameter fits, or derivations are present that reduce the headline performance numbers to tautological definitions or self-citation chains. Benchmark construction and evaluation follow standard empirical practice with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic localized hallucinations created by injection are sufficiently representative of real hallucinations; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Localized hallucinations can be injected into grounded correct answers with exact character labels to create realistic test cases for span-level detection.
    This is the core construction method stated in the abstract for building the benchmark.

pith-pipeline@v0.9.1-grok · 5742 in / 1153 out tokens · 20826 ms · 2026-07-02T13:09:35.159361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    LettuceDetect: A Hallucination Detection Framework for

    Kov. LettuceDetect: A Hallucination Detection Framework for. 2025 , eprint =

  2. [2]

    Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

  3. [3]

    RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

    Niu, Cheng and Wu, Yuanhao and Zhu, Juno and Xu, Siliang and Shum, KaShun and Zhong, Randy and Song, Juntong and Zhang, Tong. RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v...

  4. [4]

    2024 , eprint=

    Fine-grained Hallucination Detection and Editing for Language Models , author=. 2024 , eprint=

  5. [5]

    S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

    Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557

  6. [6]

    H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

    Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

  7. [7]

    L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

    Belyi, Masha and Friel, Robert and Shao, Shuai and Sanyal, Atindriyo. L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 2025

  8. [8]

    Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in...

  9. [9]

    2025 , eprint=

    CodeMirage: Hallucinations in Code Generated by Large Language Models , author=. 2025 , eprint=

  10. [10]

    2024 , eprint=

    Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code , author=. 2024 , eprint=

  11. [11]

    AgentHallu: Benchmarking Automated Hallucination Attribution of

    Liu, Xuannan and Yang, Xiao and Li, Zekun and Li, Peipei and He, Ran , year =. AgentHallu: Benchmarking Automated Hallucination Attribution of. 2601.06818 , archivePrefix =

  12. [12]

    2024 , eprint=

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

  13. [13]

    2026 , eprint =

    Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents , author =. 2026 , eprint =

  14. [14]

    ACL-Verbatim: hallucination-free question answering for research

    Recski, Gabor and Toth, Szilveszter and Verdha, Nadia and Boros, Istvan and Kovacs, Adam , year =. 2605.21102 , archivePrefix =

  15. [15]

    2026 , url =

    Open Wikipedia (Markdown) , author =. 2026 , url =

  16. [16]

    2026 , howpublished =

    Gemma 4 31B IT Model Card , author =. 2026 , howpublished =

  17. [17]

    Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

    Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

  18. [18]

    2025 , eprint=

    mmBERT: A Modern Multilingual Encoder with Annealed Language Learning , author=. 2025 , eprint=

  19. [19]

    doi:10.57967/hf/3240 , publisher =

    Miaoran Li and Rogger Luo and Ofer Mendelevitch , title =. doi:10.57967/hf/3240 , publisher =

  20. [20]

    2024 , eprint=

    Lynx: An Open Source Hallucination Evaluation Model , author=. 2024 , eprint=

  21. [21]

    and Hind, Michael and Geyer, Werner and Rawat, Ambrish and Varshney, Kush R

    Padhi, Inkit and Nagireddy, Manish and Cornacchia, Giandomenico and Chaudhury, Subhajit and Pedapati, Tejaswini and Dognin, Pierre and Murugesan, Keerthiram and Miehling, Erik and Santill \'a n Cooper, Mart \'i n and Fraser, Kieran and Zizzo, Giulio and Hameed, Muhammad Zaid and Purcell, Mark and Desmond, Michael and Pan, Qian and Vejsbjerg, Inge and Daly...

  22. [22]

    2026 , eprint=

    Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning , author=. 2026 , eprint=

  23. [23]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  24. [24]

    arXiv preprint arXiv:2511.23404 , year =

    LFM2 Technical Report , author =. arXiv preprint arXiv:2511.23404 , year =

  25. [25]

    Liquid AI Blog , year =

    Liquid AI , title =. Liquid AI Blog , year =

  26. [26]

    RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation

    Song, Juntong and Wang, Xingguang and Zhu, Juno and Wu, Yuanhao and Cheng, Xuxin and Zhong, Randy and Niu, Cheng. RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.113

  27. [27]

    2025 , eprint=

    Learning to Reason for Hallucination Span Detection , author=. 2025 , eprint=

  28. [28]

    M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents

    Tang, Liyan and Laban, Philippe and Durrett, Greg. M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.499

  29. [29]

    When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA

    Rykov, Elisei and Petrushina, Kseniia and Savkin, Maksim and Olisov, Valerii and Vazhentsev, Artem and Titova, Kseniia and Panchenko, Alexander and Konovalov, Vasily and Belikova, Julia. When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. do...

  30. [30]

    S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

    Vazquez, Raul and Mickus, Timothee and Zosa, Elaine and Vahtola, Teemu and Tiedemann, J. S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025). 2025

  31. [31]

    2026 , eprint=

    Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks , author=. 2026 , eprint=

  32. [32]

    2026 , eprint=

    Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems , author=. 2026 , eprint=