Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

\'Ad\'am Kov\'acs; Bowei He; G\'abor Recski; Istv\'an Boros; Szilveszter T\'oth; Xue Liu

arxiv: 2607.00895 · v1 · pith:T7PUTPAZnew · submitted 2026-07-01 · 💻 cs.CL

Beyond Document Grounding: Span-Level Hallucination Detection over Code, Tool Output, and Documents

\'Ad\'am Kov\'acs , Bowei He , Xue Liu , Istv\'an Boros , Szilveszter T\'oth , G\'abor Recski This is my paper

Pith reviewed 2026-07-02 13:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords hallucination detectionspan-level detectionretrieval-augmented generationcode generationtool usebenchmark construction

0 comments

The pith

A fine-tuned 2B model detects span-level hallucinations over code, tool outputs, and documents at 0.689 F1.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark that extends hallucination detection beyond natural-language documents to include source code, tool outputs, markdown, tables, and repository metadata. It generates test data by taking correct grounded answers and inserting localized hallucinations marked at the exact character level, then validates the code portion through evidence review. A Qwen3.5-2B model fine-tuned on this data reaches 0.689 span-F1 on the full test set and 0.60 on the code-agent portion. The same model stays competitive with 81.8 example-F1 on RAGTruth and 0.724 IoU on English PsiloQA.

Core claim

The authors introduce a unified benchmark for span-level hallucination detection across code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is constructed by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split. Their fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, substantially outperforming LettuceDetect-large at 0.17 and the strongest zero-shot LLM judges at most 0.22, while remaining competitive on established natural-language benchmarks.

What carries the argument

The unified benchmark built by injecting localized hallucinations into grounded correct answers with exact character labels.

If this is right

Span-level detectors can now be trained and evaluated on structured inputs such as code and tool output in addition to documents.
A 2B-parameter model fine-tuned on the benchmark outperforms both prior detectors and zero-shot LLM judges on the new data sources.
The same fine-tuned model maintains competitive performance on existing natural-language hallucination benchmarks.
The injection method with character-level labels enables precise span evaluation across multiple input types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers building code agents could integrate such a detector to flag unsupported spans before execution.
The benchmark construction approach might be adapted to other structured domains such as database queries or API responses.
Fine-tuning on synthetic localized errors may reduce reliance on large zero-shot judges for structured hallucination tasks.

Load-bearing premise

Starting from grounded correct answers and injecting localized hallucinations with exact labels produces test cases representative of real hallucinations in grounded generation systems that use code and tool outputs.

What would settle it

Direct comparison of the injected hallucinations against actual errors made by deployed code agents on real tasks would show whether the benchmark distributions match real usage.

Figures

Figures reproduced from arXiv: 2607.00895 by \'Ad\'am Kov\'acs, Bowei He, G\'abor Recski, Istv\'an Boros, Szilveszter T\'oth, Xue Liu.

**Figure 2.** Figure 2: Example benchmark instance. We use [H] ... [/H] markers for gold unsupported spans. observation. ACL injections use paper-specific numerical, entity, relational, methodological, and citation-like edits detectable from the retrieved excerpts. README and Wikipedia injections use a generic markdown prompt covering numerical, temporal, entity, relational, fabricated-reference, and unsupported-claim edits. 4.4… view at source ↗

**Figure 3.** Figure 3: Reference grounding for code examples. Cor [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Detector prompt used for generative detector training and zero-shot LLM-judge evaluation. For code-agent [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Question and clean-answer generation prompts. README and Wikipedia examples first generate a [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Generic injection prompt. The model proposes structured replacement edits; the pipeline applies them [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Hallucination detection for retrieval-augmented generation (RAG) is usually evaluated on natural-language document evidence. However, grounded generation systems increasingly rely on structured inputs: source code, developer-tool output, markdown documents, tables, and repository metadata. We introduce a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and existing natural-language RAG datasets. The benchmark is built by starting from grounded correct answers, injecting localized hallucinations with exact character labels, and validating the code test split with evidence-based review. Our fine-tuned Qwen3.5-2B detector reaches 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, where it substantially outperforms LettuceDetect-large (0.17) and the strongest zero-shot LLM judges we evaluated (at most 0.22). The same model remains competitive on established natural-language benchmarks, with 81.8 RAGTruth example-F1 and 0.724 English PsiloQA IoU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a unified benchmark for span-level hallucination detection on code and tool outputs via synthetic injection, with a small fine-tuned model showing gains over baselines, though the data construction leaves open questions about realism.

read the letter

The paper's main contribution is a benchmark that extends hallucination detection to code, tool output, and structured documents by taking grounded answers and injecting localized hallucinations with exact character-span labels. They validate the code split with evidence-based review and report that a fine-tuned Qwen3.5-2B reaches 0.689 span-F1 on the unified set and 0.60 on the code-agent source, beating LettuceDetect-large and zero-shot LLM judges while staying competitive on existing natural-language sets.

This setup is straightforward and gives concrete baseline numbers for a practical setting. The fact that the same small model holds up on prior NL benchmarks is a plus, and the unified coverage itself fills a gap for RAG systems that mix code and tools.

The soft spot is the data generation step. Localized injection assumes real hallucinations are mostly small, character-level deviations from otherwise correct spans. In code and tool outputs, errors often involve non-local problems such as wrong API semantics, type mismatches across calls, or context misapplication. The evidence-based review confirms the injected labels are accurate but does not test whether the injected distribution matches what actual grounded generators produce. If the synthetic cases are narrower or easier, the performance edge may not carry over to real inputs.

This is for researchers building or evaluating detectors for code-aware RAG. A reader who needs new test data or baselines in that niche would get direct value from the numbers and the dataset construction details.

I would send it for peer review. The benchmark is new enough to warrant referee input on its construction and representativeness, even if the evaluation section needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper introduces a unified benchmark for span-level hallucination detection over code, tool output, structured documents, and natural-language RAG datasets. The benchmark is constructed by starting from grounded correct answers and injecting localized hallucinations with exact character-span labels; the code split receives an evidence-based review. A fine-tuned Qwen3.5-2B detector achieves 0.689 span-F1 on the unified test set and 0.60 on the code-agent source, substantially outperforming LettuceDetect-large (0.17) and the strongest zero-shot LLM judges (≤0.22). The same model remains competitive on established NL benchmarks (81.8 RAGTruth example-F1, 0.724 English PsiloQA IoU).

Significance. If the benchmark construction yields representative test cases, the work supplies a practical detector and evaluation resource for grounded generation systems that consume non-textual inputs. The reported gains on the code-agent source and the use of exact span labels for fine-grained evaluation are concrete strengths. The competitive retention of performance on existing NL benchmarks further supports the unified approach.

major comments (1)

[Abstract] Abstract / benchmark construction: the headline claim of superiority on the code-agent source (0.60 span-F1) rests on test cases generated by localized character-level injection. Real code-agent hallucinations frequently involve non-local phenomena (incorrect API semantics, control-flow errors, type mismatches across calls). The evidence-based review validates the injected labels but does not establish that the synthetic distribution matches the error distribution of actual grounded generators; this directly affects the interpretation of outperformance over LettuceDetect and zero-shot judges.

minor comments (1)

[Abstract] The abstract refers to 'existing natural-language RAG datasets' without naming the specific corpora or describing integration details; this should be clarified in the methods section for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our benchmark construction. We address the major comment point-by-point below and outline planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract / benchmark construction: the headline claim of superiority on the code-agent source (0.60 span-F1) rests on test cases generated by localized character-level injection. Real code-agent hallucinations frequently involve non-local phenomena (incorrect API semantics, control-flow errors, type mismatches across calls). The evidence-based review validates the injected labels but does not establish that the synthetic distribution matches the error distribution of actual grounded generators; this directly affects the interpretation of outperformance over LettuceDetect and zero-shot judges.

Authors: We agree that the benchmark is constructed via localized character-level injection into grounded answers (Section 3), which enables precise span annotations unavailable in most natural hallucination corpora. The evidence-based review on the code split confirms label validity but, as the referee notes, does not prove distributional equivalence to real generator errors. We do not claim such equivalence; the benchmark is explicitly positioned as a controlled testbed for span-level detection of localized hallucinations across modalities. The reported gains (including 0.60 span-F1 on the code-agent source) therefore demonstrate relative effectiveness on this synthetic distribution rather than a universal claim about real-world error coverage. The competitive retention of performance on RAGTruth and PsiloQA provides supporting evidence of broader utility. In revision we will (1) add an explicit limitations subsection clarifying the synthetic/localized scope and (2) temper abstract language to avoid implying direct equivalence to naturalistic distributions. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs an explicit synthetic benchmark by injecting localized hallucinations into grounded answers and reports measured span-F1 on the resulting test splits. No equations, parameter fits, or derivations are present that reduce the headline performance numbers to tautological definitions or self-citation chains. Benchmark construction and evaluation follow standard empirical practice with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that synthetic localized hallucinations created by injection are sufficiently representative of real hallucinations; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Localized hallucinations can be injected into grounded correct answers with exact character labels to create realistic test cases for span-level detection.
This is the core construction method stated in the abstract for building the benchmark.

pith-pipeline@v0.9.1-grok · 5742 in / 1153 out tokens · 20826 ms · 2026-07-02T13:09:35.159361+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 12 canonical work pages · 1 internal anchor

[1]

LettuceDetect: A Hallucination Detection Framework for

Kov. LettuceDetect: A Hallucination Detection Framework for. 2025 , eprint =

2025
[2]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =
[3]

RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Niu, Cheng and Wu, Yuanhao and Zhu, Juno and Xu, Siliang and Shum, KaShun and Zhong, Randy and Song, Juntong and Zhang, Tong. RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.acl-long.585 2024
[4]

2024 , eprint=

Fine-grained Hallucination Detection and Editing for Language Models , author=. 2024 , eprint=

2024
[5]

S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[6]

H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023
[7]

L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

Belyi, Masha and Friel, Robert and Shao, Shuai and Sanyal, Atindriyo. L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 2025

2025
[8]

Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in...

work page doi:10.1609/aaai.v39i24.34717 2025
[9]

2025 , eprint=

CodeMirage: Hallucinations in Code Generated by Large Language Models , author=. 2025 , eprint=

2025
[10]

2024 , eprint=

Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code , author=. 2024 , eprint=

2024
[11]

AgentHallu: Benchmarking Automated Hallucination Attribution of

Liu, Xuannan and Yang, Xiao and Li, Zekun and Li, Peipei and He, Ran , year =. AgentHallu: Benchmarking Automated Hallucination Attribution of. 2601.06818 , archivePrefix =

work page arXiv
[12]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

2024
[13]

2026 , eprint =

Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents , author =. 2026 , eprint =

2026
[14]

ACL-Verbatim: hallucination-free question answering for research

Recski, Gabor and Toth, Szilveszter and Verdha, Nadia and Boros, Istvan and Kovacs, Adam , year =. 2605.21102 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[15]

2026 , url =

Open Wikipedia (Markdown) , author =. 2026 , url =

2026
[16]

2026 , howpublished =

Gemma 4 31B IT Model Card , author =. 2026 , howpublished =

2026
[17]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
[18]

2025 , eprint=

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning , author=. 2025 , eprint=

2025
[19]

doi:10.57967/hf/3240 , publisher =

Miaoran Li and Rogger Luo and Ofer Mendelevitch , title =. doi:10.57967/hf/3240 , publisher =

work page doi:10.57967/hf/3240
[20]

2024 , eprint=

Lynx: An Open Source Hallucination Evaluation Model , author=. 2024 , eprint=

2024
[21]

and Hind, Michael and Geyer, Werner and Rawat, Ambrish and Varshney, Kush R

Padhi, Inkit and Nagireddy, Manish and Cornacchia, Giandomenico and Chaudhury, Subhajit and Pedapati, Tejaswini and Dognin, Pierre and Murugesan, Keerthiram and Miehling, Erik and Santill \'a n Cooper, Mart \'i n and Fraser, Kieran and Zizzo, Giulio and Hameed, Muhammad Zaid and Purcell, Mark and Desmond, Michael and Pan, Qian and Vejsbjerg, Inge and Daly...

work page doi:10.18653/v1/2025.naacl-industry.49 2025
[22]

2026 , eprint=

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning , author=. 2026 , eprint=

2026
[23]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025
[24]

arXiv preprint arXiv:2511.23404 , year =

LFM2 Technical Report , author =. arXiv preprint arXiv:2511.23404 , year =

work page arXiv
[25]

Liquid AI Blog , year =

Liquid AI , title =. Liquid AI Blog , year =
[26]

RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation

Song, Juntong and Wang, Xingguang and Zhu, Juno and Wu, Yuanhao and Cheng, Xuxin and Zhong, Randy and Niu, Cheng. RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.113

work page doi:10.18653/v1/2024.emnlp-industry.113 2024
[27]

2025 , eprint=

Learning to Reason for Hallucination Span Detection , author=. 2025 , eprint=

2025
[28]

M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents

Tang, Liyan and Laban, Philippe and Durrett, Greg. M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.499

work page doi:10.18653/v1/2024.emnlp-main.499 2024
[29]

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA

Rykov, Elisei and Petrushina, Kseniia and Savkin, Maksim and Olisov, Valerii and Vazhentsev, Artem and Titova, Kseniia and Panchenko, Alexander and Konovalov, Vasily and Belikova, Julia. When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. do...

work page doi:10.18653/v1/2025.findings-emnlp.626 2025
[30]

S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Vazquez, Raul and Mickus, Timothee and Zosa, Elaine and Vahtola, Teemu and Tiedemann, J. S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025). 2025

2025
[31]

2026 , eprint=

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks , author=. 2026 , eprint=

2026
[32]

2026 , eprint=

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems , author=. 2026 , eprint=

2026

[1] [1]

LettuceDetect: A Hallucination Detection Framework for

Kov. LettuceDetect: A Hallucination Detection Framework for. 2025 , eprint =

2025

[2] [2]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , articleno =

[3] [3]

RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models

Niu, Cheng and Wu, Yuanhao and Zhu, Juno and Xu, Siliang and Shum, KaShun and Zhong, Randy and Song, Juntong and Zhang, Tong. RAGT ruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v...

work page doi:10.18653/v1/2024.acl-long.585 2024

[4] [4]

2024 , eprint=

Fine-grained Hallucination Detection and Editing for Language Models , author=. 2024 , eprint=

2024

[5] [5]

S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Manakul, Potsawee and Liusie, Adian and Gales, Mark. S elf C heck GPT : Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[6] [6]

H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

Li, Junyi and Cheng, Xiaoxue and Zhao, Xin and Nie, Jian-Yun and Wen, Ji-Rong. H alu E val: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.397

work page doi:10.18653/v1/2023.emnlp-main.397 2023

[7] [7]

L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost

Belyi, Masha and Friel, Robert and Shao, Shuai and Sanyal, Atindriyo. L una: A Lightweight Evaluation Model to Catch Language Model Hallucinations with High Accuracy and Low Cost. Proceedings of the 31st International Conference on Computational Linguistics: Industry Track. 2025

2025

[8] [8]

Tian, Yuchen and Yan, Weixiang and Yang, Qian and Zhao, Xuandong and Chen, Qian and Wang, Wen and Luo, Ziyang and Ma, Lei and Song, Dawn , title =. Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in...

work page doi:10.1609/aaai.v39i24.34717 2025

[9] [9]

2025 , eprint=

CodeMirage: Hallucinations in Code Generated by Large Language Models , author=. 2025 , eprint=

2025

[10] [10]

2024 , eprint=

Collu-Bench: A Benchmark for Predicting Language Model Hallucinations in Code , author=. 2024 , eprint=

2024

[11] [11]

AgentHallu: Benchmarking Automated Hallucination Attribution of

Liu, Xuannan and Yang, Xiao and Li, Zekun and Li, Peipei and He, Ran , year =. AgentHallu: Benchmarking Automated Hallucination Attribution of. 2601.06818 , archivePrefix =

work page arXiv

[12] [12]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

2024

[13] [13]

2026 , eprint =

Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents , author =. 2026 , eprint =

2026

[14] [14]

ACL-Verbatim: hallucination-free question answering for research

Recski, Gabor and Toth, Szilveszter and Verdha, Nadia and Boros, Istvan and Kovacs, Adam , year =. 2605.21102 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

2026 , url =

Open Wikipedia (Markdown) , author =. 2026 , url =

2026

[16] [16]

2026 , howpublished =

Gemma 4 31B IT Model Card , author =. 2026 , howpublished =

2026

[17] [17]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

[18] [18]

2025 , eprint=

mmBERT: A Modern Multilingual Encoder with Annealed Language Learning , author=. 2025 , eprint=

2025

[19] [19]

doi:10.57967/hf/3240 , publisher =

Miaoran Li and Rogger Luo and Ofer Mendelevitch , title =. doi:10.57967/hf/3240 , publisher =

work page doi:10.57967/hf/3240

[20] [20]

2024 , eprint=

Lynx: An Open Source Hallucination Evaluation Model , author=. 2024 , eprint=

2024

[21] [21]

and Hind, Michael and Geyer, Werner and Rawat, Ambrish and Varshney, Kush R

Padhi, Inkit and Nagireddy, Manish and Cornacchia, Giandomenico and Chaudhury, Subhajit and Pedapati, Tejaswini and Dognin, Pierre and Murugesan, Keerthiram and Miehling, Erik and Santill \'a n Cooper, Mart \'i n and Fraser, Kieran and Zizzo, Giulio and Hameed, Muhammad Zaid and Purcell, Mark and Desmond, Michael and Pan, Qian and Vejsbjerg, Inge and Daly...

work page doi:10.18653/v1/2025.naacl-industry.49 2025

[22] [22]

2026 , eprint=

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning , author=. 2026 , eprint=

2026

[23] [23]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

2025

[24] [24]

arXiv preprint arXiv:2511.23404 , year =

LFM2 Technical Report , author =. arXiv preprint arXiv:2511.23404 , year =

work page arXiv

[25] [25]

Liquid AI Blog , year =

Liquid AI , title =. Liquid AI Blog , year =

[26] [26]

RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation

Song, Juntong and Wang, Xingguang and Zhu, Juno and Wu, Yuanhao and Cheng, Xuxin and Zhong, Randy and Niu, Cheng. RAG - HAT : A Hallucination-Aware Tuning Pipeline for LLM in Retrieval-Augmented Generation. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024. doi:10.18653/v1/2024.emnlp-industry.113

work page doi:10.18653/v1/2024.emnlp-industry.113 2024

[27] [27]

2025 , eprint=

Learning to Reason for Hallucination Span Detection , author=. 2025 , eprint=

2025

[28] [28]

M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents

Tang, Liyan and Laban, Philippe and Durrett, Greg. M ini C heck: Efficient Fact-Checking of LLM s on Grounding Documents. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.499

work page doi:10.18653/v1/2024.emnlp-main.499 2024

[29] [29]

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA

Rykov, Elisei and Petrushina, Kseniia and Savkin, Maksim and Olisov, Valerii and Vazhentsev, Artem and Titova, Kseniia and Panchenko, Alexander and Konovalov, Vasily and Belikova, Julia. When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with P silo QA. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. do...

work page doi:10.18653/v1/2025.findings-emnlp.626 2025

[30] [30]

S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Vazquez, Raul and Mickus, Timothee and Zosa, Elaine and Vahtola, Teemu and Tiedemann, J. S em E val-2025 Task 3: Mu- SHROOM , the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025). 2025

2025

[31] [31]

2026 , eprint=

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks , author=. 2026 , eprint=

2026

[32] [32]

2026 , eprint=

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems , author=. 2026 , eprint=

2026