pith. sign in

arxiv: 2605.07024 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks

Pith reviewed 2026-05-11 01:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords code hallucinationsfill-in-the-middleLLM benchmarkcode generationmulti-lingual evaluationruntime verificationadversarial benchmarking
0
0 comments X

The pith

Code LLMs still hallucinate in fill-in-the-middle tasks, with the strongest model reaching only 84.5 percent pass rate on a verified multi-lingual benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Delulu, a benchmark of 1,951 FIM samples spanning 7 languages and 4 hallucination types such as invented APIs or undefined variables. It builds the set through an adversarial process where a frontier model proposes plausible wrong completions, multiple judge models score them, embeddings cluster harder examples, Docker containers confirm that correct code runs while hallucinated versions fail at runtime, and human experts remove any biased cases. Evaluation of 11 models from five families shows the top score is 84.5 percent pass@1, no family exceeds 0.77 edit similarity, and every family produces hallucination-aligned outputs on a non-trivial share of samples. This indicates the failures are inherent to the FIM task rather than limited to particular model sizes or architectures. Readers should care because these errors look reasonable yet create runtime bugs that simple review misses.

Core claim

The central claim is that fill-in-the-middle code generation by large language models produces hallucinations such as invented methods, invalid parameters, undefined variables, and non-existent imports at rates that persist across model families and scales, as shown by Delulu where the strongest model reaches only 84.5 percent pass@1, edit similarity stays below 0.77 for all families, and every family outputs hallucination-aligned completions on a meaningful fraction of the verified samples.

What carries the argument

The Delulu benchmark, created via adversarial hallucination generation by a frontier LLM, scoring by four diverse judge models, embedding-based clustering for progressive difficulty, self-contained Docker containers for runtime error verification, and final human-expert review to remove bias.

If this is right

  • Model development must target FIM-specific mechanisms to avoid inventing code elements rather than relying on general scaling.
  • The benchmark supplies a consistent, verified test for measuring reduction in code hallucinations across languages and model sizes.
  • Code assistants using FIM should pair completions with additional static or runtime checks before suggesting them to users.
  • The same verification approach of Docker-isolated execution and multi-judge review could expose hidden failure modes in other code-generation settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines that incorporate similar runtime verification during data filtering could lower hallucination rates more effectively than post-hoc detection alone.
  • The multi-lingual design suggests that language-specific fine-tuning or data augmentation may be needed beyond cross-lingual transfer.
  • Releasing the containers and framework enables community extensions such as adding new hallucination categories or testing on proprietary models.

Load-bearing premise

The full pipeline of adversarial generation, multi-judge filtering, embedding clustering, Docker runtime checks, and human review yields samples that accurately capture real-world FIM hallucinations without introducing systematic bias or artificial difficulty.

What would settle it

A new model achieving pass@1 above 95 percent and edit similarity above 0.85 on the full Delulu set, with no signs of overfitting, would undermine the claim that the observed difficulty is task-intrinsic.

Figures

Figures reproduced from arXiv: 2605.07024 by Aashna Garg, Amabel Gale, Mahdi Erfanian, Nelson Daniel Troncoso, Pareesa Ameneh Golnari, Shengyu Fu, Xiaoyu Liu.

Figure 1
Figure 1. Figure 1: A DELULU sample with golden and hallucinated completions, Docker-verified across 7 languages and 4 hallucination types. Regarding generation, recent models have made substan￾tial progress on established code benchmarks: top perform￾ers now exceed 90% pass@1 on HumanEval [Chen et al., 2021] and achieve competitive scores on SAFIM [Gong et al., 2024]. Yet this progress is increasingly difficult to interpret.… view at source ↗
Figure 2
Figure 2. Figure 2: The five-stage DELULU curation pipeline. 2. A five-stage curation pipeline that pairs each completion mined from real GitHub code with a hallucinated counterpart, filters trivially-decidable cases by repeatedly probing frontier judges, and retains only samples that compile as golden and provably fail as hallucinated (§2.2). 3. 1,951 execution-verified samples across 7 languages, each packaged as a self-con… view at source ↗
Figure 3
Figure 3. Figure 3: Qwen2.5-Coder scaling on DELULU: pass@1 by language (left) and hallucination type (right). Import is hardest at every scale; Rust and Python are the most challenging languages [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-family static metrics on DELULU. Dashed line separates the Qwen scaling slate (left) from the cross-family slate (right) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-language pass@1 for the cross￾family slate. (2) Hallucination is universal. Every model surveyed produces hallucination-aligned com￾pletions on 0.7–2.0% of samples; StarCoder2- 15B has the lowest similarity-based HR (0.007) yet still trails Qwen-32B’s Edit Similarity by 16.5 absolute points. (3) Instruction tuning helps the model know when to stop, not what to write. Base FIM mod￾els such as StarCoder2… view at source ↗
Figure 6
Figure 6. Figure 6: The expert annotation interface. Each sample displays the prefix, suffix, and the two [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CLAUDE SONNET-assisted fix loop: a C++ sample requires three iterations to resolve missing dependencies before golden (pass) and hallucinated (expected fail) verifications both succeed. Container finalization. Successfully verified samples are packaged as self-contained Docker con￾tainers with all code and dependencies baked in, then pushed to Azure Container Registry. Each container supports three invocat… view at source ↗
Figure 8
Figure 8. Figure 8: CodeBLEU (left) and Edit Similarity (right) by language across Qwen2.5-Coder model [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: pass@1 heatmaps per model: language (rows) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are constructed via an adversarial pipeline (frontier LLM generation of plausible hallucinations, evaluation by four diverse judge models, embedding-based clustering for harder examples, Docker runtime verification that golden completions succeed while hallucinations fail, and final human-expert review). Evaluation of 11 open-weight FIM models from five families (Qwen2.5-Coder scaling slate plus CodeLlama, DeepSeek-Coder-V2, StarCoder2) shows the strongest model at only 84.5% pass@1, no family exceeding 0.77 Edit Similarity, and every family producing hallucination-aligned completions on a non-trivial share of samples, from which the authors conclude that the exposed difficulty is task-intrinsic rather than family-specific. The benchmark, containers, and evaluation framework are released.

Significance. If the curation pipeline produces samples that accurately reflect real-world FIM hallucinations without systematic bias from the generator or judges, Delulu would be a valuable, reproducible resource for the code-generation community. It supplies concrete, verified failure cases (invented APIs, invalid parameters, undefined variables, non-existent imports) that pass superficial checks yet cause runtime errors, together with a multi-lingual, multi-type coverage that current benchmarks largely lack. The released Docker containers and evaluation code further strengthen its utility for future model development and verification research.

major comments (2)
  1. [§3 (Benchmark Construction Pipeline)] §3 (Benchmark Construction Pipeline): The central claim that difficulty is task-intrinsic (abstract and §4) rests on every model family producing hallucination-aligned completions on a non-trivial share of the 1,951 samples. This requires the samples to be free of systematic bias introduced by the adversarial generator, four-judge filtering, embedding clustering, and human review. The manuscript reports no quantitative checks such as inter-annotator agreement for the human step, hallucination-rate comparison against a non-adversarial random FIM sample, or ablation of the clustering component. Without these, the uniform failure pattern could still reflect shared training-data or curation artifacts.
  2. [§4 (Evaluation Results)] §4 (Evaluation Results): The reported aggregate metrics (84.5% pass@1 for the strongest model, Edit Similarity ≤0.77 across families) are presented without per-language or per-hallucination-type breakdowns. Such breakdowns are needed to substantiate the cross-family consistency claim and to rule out the possibility that a small subset of languages or hallucination types drives the observed difficulty.
minor comments (3)
  1. [§3.3] The abstract states that Docker containers verify 'golden completions compile while hallucinated variants produce the expected runtime error.' A short paragraph in §3.3 or an appendix table listing the concrete runtime error categories (e.g., AttributeError, ImportError) per hallucination type would improve reproducibility.
  2. [§4.1] Notation: 'pass@1' and 'Edit Similarity' are used without an explicit definition or reference to the exact implementation (e.g., whether pass@1 uses the standard CodeXGLUE-style execution or a custom harness). Adding a one-sentence definition in §4.1 would remove ambiguity.
  3. [Conclusion] The release URL is given only in the abstract. Including the GitHub link and a brief description of the released artifacts (benchmark JSON, Dockerfiles, evaluation scripts) in the conclusion or a dedicated 'Reproducibility' subsection would aid readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, providing clarifications on our methodology and committing to revisions that strengthen the evidence for our claims without overstating the current results.

read point-by-point responses
  1. Referee: §3 (Benchmark Construction Pipeline): The central claim that difficulty is task-intrinsic (abstract and §4) rests on every model family producing hallucination-aligned completions on a non-trivial share of the 1,951 samples. This requires the samples to be free of systematic bias introduced by the adversarial generator, four-judge filtering, embedding clustering, and human review. The manuscript reports no quantitative checks such as inter-annotator agreement for the human step, hallucination-rate comparison against a non-adversarial random FIM sample, or ablation of the clustering component. Without these, the uniform failure pattern could still reflect shared training-data or curation artifacts.

    Authors: We agree that additional quantitative validation would further support the task-intrinsic claim. The human review was conducted by a single expert using predefined criteria to exclude biased or trivial samples, rendering inter-annotator agreement inapplicable; we will explicitly state this and the review criteria in the revision. We will add a direct comparison of model hallucination rates on Delulu versus randomly sampled FIM completions from the same source distributions, which preliminary checks indicate are substantially lower. We will also include an ablation showing the effect of removing the embedding-based clustering step, which increases the proportion of easier samples solved by all models. These elements will be incorporated to address potential curation artifacts. revision: partial

  2. Referee: §4 (Evaluation Results): The reported aggregate metrics (84.5% pass@1 for the strongest model, Edit Similarity ≤0.77 across families) are presented without per-language or per-hallucination-type breakdowns. Such breakdowns are needed to substantiate the cross-family consistency claim and to rule out the possibility that a small subset of languages or hallucination types drives the observed difficulty.

    Authors: We agree that aggregate metrics alone leave room for the possibility that difficulty is concentrated in particular subsets. In the revised manuscript we will add per-language and per-hallucination-type breakdowns of both pass@1 and Edit Similarity for all evaluated models. These will be presented in additional tables or figures and will show that underperformance is distributed across languages and types rather than driven by a small subset, thereby strengthening the cross-family consistency argument. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed from independent external steps; model scores measured against it

full rationale

The paper describes an empirical benchmark creation pipeline (adversarial LLM generation, multi-judge filtering, embedding clustering, Docker runtime verification, human review) followed by evaluation of 11 external models on the resulting 1,951 samples. No equations, fitted parameters, or predictions are presented. No self-citations are used to justify uniqueness or load-bearing claims. The central observation that difficulty appears across families is a direct measurement on the curated data rather than a reduction to any input by construction. This matches the default expectation of no significant circularity for a benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard domain assumptions about LLM behavior and verification feasibility but introduces no free parameters, new entities, or ad-hoc axioms beyond those typical for benchmark construction.

axioms (2)
  • domain assumption Frontier LLMs can generate plausible but incorrect code completions that pass superficial checks
    Invoked to justify the adversarial generation step in the pipeline.
  • domain assumption Docker containers can reliably distinguish compilable golden code from runtime-failing hallucinated variants
    Central to the verification stage described.

pith-pipeline@v0.9.0 · 5590 in / 1288 out tokens · 39837 ms · 2026-05-11T01:19:34.560514+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    arXiv.org , title =

    Golnari, Pareesa Ameneh and Kumarappan, Adarsh and Wen, Wen and Liu, Xiaoyu and Ryan, Gabriel and Sun, Yuting and Fu, Shengyu and Nallipogu, Elsie , doi =. arXiv.org , title =

  2. [2]

    arXiv.org , title =

    Agarwal, Vibhor and Pei, Yulong and Alamir, Salwa and Liu, Xiaomo , doi =. arXiv.org , title =

  3. [3]

    Applying RLAIF for Code Generation with API-usage in Lightweight LLMs , url =

    Dutta, Sujan and Mahinder, Sayantan and Anantha, Raviteja and Bandyopadhyay, Bortik , booktitle =. Applying RLAIF for Code Generation with API-usage in Lightweight LLMs , url =. doi:10.48550/arXiv.2406.20060 , journal =

  4. [4]

    and Dohan, David and Jiang, Ellen and Cai, Carrie J

    Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, H. and Dohan, David and Jiang, Ellen and Cai, Carrie J. and Terry, Michael and Le, Quoc V. and others , journal =. Program Synthesis with Large Language Models , year =

  5. [5]

    arXiv.org , title =

    Bavarian, Mohammad and Jun, Heewoo and Tezak, Nikolas and Schulman, John and McLeavey, Christine and Tworek, Jerry and Chen, Mark , doi =. arXiv.org , title =

  6. [6]

    and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q

    Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, S. and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q. and others , doi =. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , volume =. IEEE Transactions on Software Engineering , number =

  7. [7]

    Evaluating Large Language Models Trained on Code , year =

    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pondé, Henrique and Kaplan, Jared and Edwards, Harrison and Burda, Yura and Joseph, Nicholas and Brockman, Greg and others , journal =. Evaluating Large Language Models Trained on Code , year =

  8. [8]

    arXiv.org , year =

    Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author =. arXiv.org , year =

  9. [9]

    Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi Uddin and Ding, Hantian and Tan, Ming and Jain, Nihal and Ramanathan, M. K. and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and others , doi =. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion , year =. Neural Information Processing Systems , pages =

  10. [10]

    arXiv.org , title =

    Du, Xueying and Liu, Mingwei and Wang, Kaixin and Wang, Hanlin and Liu, Junwei and Chen, Yixuan and Feng, Jiayi and Sha, Chaofeng and Peng, Xin and Lou, Yiling , doi =. arXiv.org , title =

  11. [11]

    Evaluation of

    Gong, Linyuan and Wang, Sida and Elhoushi, Mostafa and Cheung, Alvin , year =. Evaluation of

  12. [12]

    and others , journal =

    Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, D. and others , journal =. Measuring Coding Challenge Competence With APPS , volume =

  13. [13]

    and others , journal =

    Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, K. and others , journal =. Qwen2.5-Coder Technical Report , year =

  14. [14]

    Mapping Language to Code in Programmatic Context , year =

    Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke , doi =. Mapping Language to Code in Programmatic Context , year =. Conference on Empirical Methods in Natural Language Processing , pages =

  15. [15]

    International Conference on Learning Representations , title =

    Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , doi =. International Conference on Learning Representations , title =

  16. [16]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , doi =

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , doi =. International Conference on Learning Representations , title =

  17. [17]

    and Serebrenik, Alexander and Vinju, J

    Landman, D. and Serebrenik, Alexander and Vinju, J. , doi =. Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions , volume =. J. Softw. Evol. Process. , number =

  18. [18]

    arXiv.org , title =

    Lee, Yunseo and Song, John Youngeun and Kim, Dongsun and Kim, Jindae and Kim, Mijung and Nam, Jaechang , doi =. arXiv.org , title =

  19. [19]

    arXiv.org , year =

    Gao, Cuiyun and Fan, Guodong and Chong, Chun Yong and Chen, Shizhan and Liu, Chao and Lo, David and Zheng, Zibin and Liao, Qing , title =. arXiv.org , year =

  20. [20]

    and others , journal =

    Rozière, Baptiste and Gehring, Jonas and Gloeckle, Fabian and Sootla, Sten and Gat, Itai and Tan, Xiaoqing and Adi, Yossi and Liu, Jingyu and Remez, Tal and Rapin, J. and others , journal =. Code. 2023 , doi =

  21. [21]

    and Li, Yukun and Gao, Huazuo and others , journal =

    DeepSeek-AI and Zhu, Qihao and Guo, Daya and Shao, Zhihong and Yang, Dejian and Wang, Peiyi and Xu, Runxin and Wu, Y. and Li, Yukun and Gao, Huazuo and others , journal =. 2024 , doi =

  22. [22]

    and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others , journal =

    Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, J. and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others , journal =. 2024 , doi =

  23. [23]

    Benchmarks, Metrics, and Evaluations of Code Generation: A Critical Review , year =

    Paul, Debalina Ghosh and Zhu, Hong and Bayley, Ian , doi =. Benchmarks, Metrics, and Evaluations of Code Generation: A Critical Review , year =. International Conference on Artificial Intelligence Testing , pages =

  24. [24]

    and Blanco, Ambrosio and Ma, Shuai , journal =

    Ren, Shuo and Guo, Daya and Lu, Shuai and Zhou, Long and Liu, Shujie and Tang, Duyu and Zhou, M. and Blanco, Ambrosio and Ma, Shuai , journal =. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis , year =

  25. [25]

    International Conference on Automated Software Engineering , title =

    Wu, Qinyun and Peng, Chao and Gao, Pengfei and Hu, Ruida and Gan, Haoyu and Jiang, Bo and Tang, Jin and Deng, Zhiwen and Guan, Zhanming and Gao, Cuiyun and others , doi =. International Conference on Automated Software Engineering , title =

  26. [26]

    Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , year =

    Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham , doi =. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , year =. IEEE Working Conference on Mining Software Repositories , pages =

  27. [27]

    CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models , year =

    Yu, Hao and Shen, Bo and Ran, Dezhi and Zhang, Jiaxin and Zhang, Qi and Ma, Yuchi and Liang, Guangtai and Li, Ying and Wang, Qianxiang and Xie, Tao , doi =. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models , year =. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages =

  28. [28]

    International Conference on Learning Representations , title =

    Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , doi =. International Conference on Learning Representations , title =

  29. [29]

    arXiv.org , year =

    EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories , author =. arXiv.org , year =

  30. [30]

    ACM Computing Surveys , volume =

    Survey of Hallucination in Natural Language Generation , author =. ACM Computing Surveys , volume =. 2022 , doi =