Delulu: A Verified Multi-Lingual Benchmark for Code Hallucination Detection in Fill-in-the-Middle Tasks
Pith reviewed 2026-05-11 01:19 UTC · model grok-4.3
The pith
Code LLMs still hallucinate in fill-in-the-middle tasks, with the strongest model reaching only 84.5 percent pass rate on a verified multi-lingual benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that fill-in-the-middle code generation by large language models produces hallucinations such as invented methods, invalid parameters, undefined variables, and non-existent imports at rates that persist across model families and scales, as shown by Delulu where the strongest model reaches only 84.5 percent pass@1, edit similarity stays below 0.77 for all families, and every family outputs hallucination-aligned completions on a meaningful fraction of the verified samples.
What carries the argument
The Delulu benchmark, created via adversarial hallucination generation by a frontier LLM, scoring by four diverse judge models, embedding-based clustering for progressive difficulty, self-contained Docker containers for runtime error verification, and final human-expert review to remove bias.
If this is right
- Model development must target FIM-specific mechanisms to avoid inventing code elements rather than relying on general scaling.
- The benchmark supplies a consistent, verified test for measuring reduction in code hallucinations across languages and model sizes.
- Code assistants using FIM should pair completions with additional static or runtime checks before suggesting them to users.
- The same verification approach of Docker-isolated execution and multi-judge review could expose hidden failure modes in other code-generation settings.
Where Pith is reading between the lines
- Training pipelines that incorporate similar runtime verification during data filtering could lower hallucination rates more effectively than post-hoc detection alone.
- The multi-lingual design suggests that language-specific fine-tuning or data augmentation may be needed beyond cross-lingual transfer.
- Releasing the containers and framework enables community extensions such as adding new hallucination categories or testing on proprietary models.
Load-bearing premise
The full pipeline of adversarial generation, multi-judge filtering, embedding clustering, Docker runtime checks, and human review yields samples that accurately capture real-world FIM hallucinations without introducing systematic bias or artificial difficulty.
What would settle it
A new model achieving pass@1 above 95 percent and edit similarity above 0.85 on the full Delulu set, with no signs of overfitting, would undermine the claim that the observed difficulty is task-intrinsic.
Figures
read the original abstract
Large Language Models for code generation frequently produce hallucinations in Fill-in-the-Middle (FIM) tasks -- plausible but incorrect completions such as invented API methods, invalid parameters, undefined variables, or non-existent imports. These failures pass superficial review yet introduce runtime errors. We introduce Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are curated through an adversarial pipeline: a frontier LLM generates plausible hallucinations, four diverse judge models evaluate them, embedding-based clustering mines progressively harder examples, self-contained Docker containers verify that golden completions compile while hallucinated variants produce the expected runtime error, and a final human-expert review removes any remaining biased or trivially decidable samples. We evaluate 11 open-weight FIM models from five families spanning 0.5B-32B parameters: a six-point Qwen2.5-Coder scaling slate, plus a cross-family slate (CodeLlama, DeepSeek-Coder-V2, StarCoder2). The strongest model reaches only 84.5% pass@1, no family exceeds 0.77 Edit Similarity, and every family produces hallucination-aligned completions on a non-trivial share of samples, confirming that the difficulty exposed by Delulu is task-intrinsic rather than family-specific. We release the benchmark, containers, and evaluation framework at https://github.com/microsoft/delulu.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Delulu, a verified multi-lingual benchmark of 1,951 FIM samples across 7 languages and 4 hallucination types. Samples are constructed via an adversarial pipeline (frontier LLM generation of plausible hallucinations, evaluation by four diverse judge models, embedding-based clustering for harder examples, Docker runtime verification that golden completions succeed while hallucinations fail, and final human-expert review). Evaluation of 11 open-weight FIM models from five families (Qwen2.5-Coder scaling slate plus CodeLlama, DeepSeek-Coder-V2, StarCoder2) shows the strongest model at only 84.5% pass@1, no family exceeding 0.77 Edit Similarity, and every family producing hallucination-aligned completions on a non-trivial share of samples, from which the authors conclude that the exposed difficulty is task-intrinsic rather than family-specific. The benchmark, containers, and evaluation framework are released.
Significance. If the curation pipeline produces samples that accurately reflect real-world FIM hallucinations without systematic bias from the generator or judges, Delulu would be a valuable, reproducible resource for the code-generation community. It supplies concrete, verified failure cases (invented APIs, invalid parameters, undefined variables, non-existent imports) that pass superficial checks yet cause runtime errors, together with a multi-lingual, multi-type coverage that current benchmarks largely lack. The released Docker containers and evaluation code further strengthen its utility for future model development and verification research.
major comments (2)
- [§3 (Benchmark Construction Pipeline)] §3 (Benchmark Construction Pipeline): The central claim that difficulty is task-intrinsic (abstract and §4) rests on every model family producing hallucination-aligned completions on a non-trivial share of the 1,951 samples. This requires the samples to be free of systematic bias introduced by the adversarial generator, four-judge filtering, embedding clustering, and human review. The manuscript reports no quantitative checks such as inter-annotator agreement for the human step, hallucination-rate comparison against a non-adversarial random FIM sample, or ablation of the clustering component. Without these, the uniform failure pattern could still reflect shared training-data or curation artifacts.
- [§4 (Evaluation Results)] §4 (Evaluation Results): The reported aggregate metrics (84.5% pass@1 for the strongest model, Edit Similarity ≤0.77 across families) are presented without per-language or per-hallucination-type breakdowns. Such breakdowns are needed to substantiate the cross-family consistency claim and to rule out the possibility that a small subset of languages or hallucination types drives the observed difficulty.
minor comments (3)
- [§3.3] The abstract states that Docker containers verify 'golden completions compile while hallucinated variants produce the expected runtime error.' A short paragraph in §3.3 or an appendix table listing the concrete runtime error categories (e.g., AttributeError, ImportError) per hallucination type would improve reproducibility.
- [§4.1] Notation: 'pass@1' and 'Edit Similarity' are used without an explicit definition or reference to the exact implementation (e.g., whether pass@1 uses the standard CodeXGLUE-style execution or a custom harness). Adding a one-sentence definition in §4.1 would remove ambiguity.
- [Conclusion] The release URL is given only in the abstract. Including the GitHub link and a brief description of the released artifacts (benchmark JSON, Dockerfiles, evaluation scripts) in the conclusion or a dedicated 'Reproducibility' subsection would aid readers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, providing clarifications on our methodology and committing to revisions that strengthen the evidence for our claims without overstating the current results.
read point-by-point responses
-
Referee: §3 (Benchmark Construction Pipeline): The central claim that difficulty is task-intrinsic (abstract and §4) rests on every model family producing hallucination-aligned completions on a non-trivial share of the 1,951 samples. This requires the samples to be free of systematic bias introduced by the adversarial generator, four-judge filtering, embedding clustering, and human review. The manuscript reports no quantitative checks such as inter-annotator agreement for the human step, hallucination-rate comparison against a non-adversarial random FIM sample, or ablation of the clustering component. Without these, the uniform failure pattern could still reflect shared training-data or curation artifacts.
Authors: We agree that additional quantitative validation would further support the task-intrinsic claim. The human review was conducted by a single expert using predefined criteria to exclude biased or trivial samples, rendering inter-annotator agreement inapplicable; we will explicitly state this and the review criteria in the revision. We will add a direct comparison of model hallucination rates on Delulu versus randomly sampled FIM completions from the same source distributions, which preliminary checks indicate are substantially lower. We will also include an ablation showing the effect of removing the embedding-based clustering step, which increases the proportion of easier samples solved by all models. These elements will be incorporated to address potential curation artifacts. revision: partial
-
Referee: §4 (Evaluation Results): The reported aggregate metrics (84.5% pass@1 for the strongest model, Edit Similarity ≤0.77 across families) are presented without per-language or per-hallucination-type breakdowns. Such breakdowns are needed to substantiate the cross-family consistency claim and to rule out the possibility that a small subset of languages or hallucination types drives the observed difficulty.
Authors: We agree that aggregate metrics alone leave room for the possibility that difficulty is concentrated in particular subsets. In the revised manuscript we will add per-language and per-hallucination-type breakdowns of both pass@1 and Edit Similarity for all evaluated models. These will be presented in additional tables or figures and will show that underperformance is distributed across languages and types rather than driven by a small subset, thereby strengthening the cross-family consistency argument. revision: yes
Circularity Check
No circularity: benchmark constructed from independent external steps; model scores measured against it
full rationale
The paper describes an empirical benchmark creation pipeline (adversarial LLM generation, multi-judge filtering, embedding clustering, Docker runtime verification, human review) followed by evaluation of 11 external models on the resulting 1,951 samples. No equations, fitted parameters, or predictions are presented. No self-citations are used to justify uniqueness or load-bearing claims. The central observation that difficulty appears across families is a direct measurement on the curated data rather than a reduction to any input by construction. This matches the default expectation of no significant circularity for a benchmark paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Frontier LLMs can generate plausible but incorrect code completions that pass superficial checks
- domain assumption Docker containers can reliably distinguish compilable golden code from runtime-failing hallucinated variants
Reference graph
Works this paper leans on
-
[1]
Golnari, Pareesa Ameneh and Kumarappan, Adarsh and Wen, Wen and Liu, Xiaoyu and Ryan, Gabriel and Sun, Yuting and Fu, Shengyu and Nallipogu, Elsie , doi =. arXiv.org , title =
-
[2]
Agarwal, Vibhor and Pei, Yulong and Alamir, Salwa and Liu, Xiaomo , doi =. arXiv.org , title =
-
[3]
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs , url =
Dutta, Sujan and Mahinder, Sayantan and Anantha, Raviteja and Bandyopadhyay, Bortik , booktitle =. Applying RLAIF for Code Generation with API-usage in Lightweight LLMs , url =. doi:10.48550/arXiv.2406.20060 , journal =
-
[4]
and Dohan, David and Jiang, Ellen and Cai, Carrie J
Austin, Jacob and Odena, Augustus and Nye, Maxwell and Bosma, Maarten and Michalewski, H. and Dohan, David and Jiang, Ellen and Cai, Carrie J. and Terry, Michael and Le, Quoc V. and others , journal =. Program Synthesis with Large Language Models , year =
-
[5]
Bavarian, Mohammad and Jun, Heewoo and Tezak, Nikolas and Schulman, John and McLeavey, Christine and Tworek, Jerry and Chen, Mark , doi =. arXiv.org , title =
-
[6]
Cassano, Federico and Gouwar, John and Nguyen, Daniel and Nguyen, S. and Phipps-Costin, Luna and Pinckney, Donald and Yee, Ming-Ho and Zi, Yangtian and Anderson, Carolyn Jane and Feldman, Molly Q. and others , doi =. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation , volume =. IEEE Transactions on Software Engineering , number =
-
[7]
Evaluating Large Language Models Trained on Code , year =
Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pondé, Henrique and Kaplan, Jared and Edwards, Harrison and Burda, Yura and Joseph, Nicholas and Brockman, Greg and others , journal =. Evaluating Large Language Models Trained on Code , year =
-
[8]
Swe-bench pro: Can ai agents solve long-horizon software engineering tasks? , author =. arXiv.org , year =
-
[9]
Ding, Yangruibo and Wang, Zijian and Ahmad, Wasi Uddin and Ding, Hantian and Tan, Ming and Jain, Nihal and Ramanathan, M. K. and Nallapati, Ramesh and Bhatia, Parminder and Roth, Dan and others , doi =. CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion , year =. Neural Information Processing Systems , pages =
-
[10]
Du, Xueying and Liu, Mingwei and Wang, Kaixin and Wang, Hanlin and Liu, Junwei and Chen, Yixuan and Feng, Jiayi and Sha, Chaofeng and Peng, Xin and Lou, Yiling , doi =. arXiv.org , title =
-
[11]
Gong, Linyuan and Wang, Sida and Elhoushi, Mostafa and Cheung, Alvin , year =. Evaluation of
-
[12]
Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, D. and others , journal =. Measuring Coding Challenge Competence With APPS , volume =
-
[13]
Hui, Binyuan and Yang, Jian and Cui, Zeyu and Yang, Jiaxi and Liu, Dayiheng and Zhang, Lei and Liu, Tianyu and Zhang, Jiajun and Yu, Bowen and Dang, K. and others , journal =. Qwen2.5-Coder Technical Report , year =
-
[14]
Mapping Language to Code in Programmatic Context , year =
Iyer, Srinivasan and Konstas, Ioannis and Cheung, Alvin and Zettlemoyer, Luke , doi =. Mapping Language to Code in Programmatic Context , year =. Conference on Empirical Methods in Natural Language Processing , pages =
-
[15]
International Conference on Learning Representations , title =
Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , doi =. International Conference on Learning Representations , title =
-
[16]
Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , doi =. International Conference on Learning Representations , title =
-
[17]
and Serebrenik, Alexander and Vinju, J
Landman, D. and Serebrenik, Alexander and Vinju, J. , doi =. Empirical analysis of the relationship between CC and SLOC in a large corpus of Java methods and C functions , volume =. J. Softw. Evol. Process. , number =
-
[18]
Lee, Yunseo and Song, John Youngeun and Kim, Dongsun and Kim, Jindae and Kim, Mijung and Nam, Jaechang , doi =. arXiv.org , title =
-
[19]
Gao, Cuiyun and Fan, Guodong and Chong, Chun Yong and Chen, Shizhan and Liu, Chao and Lo, David and Zheng, Zibin and Liao, Qing , title =. arXiv.org , year =
-
[20]
Rozière, Baptiste and Gehring, Jonas and Gloeckle, Fabian and Sootla, Sten and Gat, Itai and Tan, Xiaoqing and Adi, Yossi and Liu, Jingyu and Remez, Tal and Rapin, J. and others , journal =. Code. 2023 , doi =
work page 2023
-
[21]
and Li, Yukun and Gao, Huazuo and others , journal =
DeepSeek-AI and Zhu, Qihao and Guo, Daya and Shao, Zhihong and Yang, Dejian and Wang, Peiyi and Xu, Runxin and Wu, Y. and Li, Yukun and Gao, Huazuo and others , journal =. 2024 , doi =
work page 2024
-
[22]
Lozhkov, Anton and Li, Raymond and Allal, Loubna Ben and Cassano, Federico and Lamy-Poirier, J. and Tazi, Nouamane and Tang, Ao and Pykhtar, Dmytro and Liu, Jiawei and Wei, Yuxiang and others , journal =. 2024 , doi =
work page 2024
-
[23]
Benchmarks, Metrics, and Evaluations of Code Generation: A Critical Review , year =
Paul, Debalina Ghosh and Zhu, Hong and Bayley, Ian , doi =. Benchmarks, Metrics, and Evaluations of Code Generation: A Critical Review , year =. International Conference on Artificial Intelligence Testing , pages =
-
[24]
and Blanco, Ambrosio and Ma, Shuai , journal =
Ren, Shuo and Guo, Daya and Lu, Shuai and Zhou, Long and Liu, Shujie and Tang, Duyu and Zhou, M. and Blanco, Ambrosio and Ma, Shuai , journal =. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis , year =
-
[25]
International Conference on Automated Software Engineering , title =
Wu, Qinyun and Peng, Chao and Gao, Pengfei and Hu, Ruida and Gan, Haoyu and Jiang, Bo and Tang, Jin and Deng, Zhiwen and Guan, Zhanming and Gao, Cuiyun and others , doi =. International Conference on Automated Software Engineering , title =
-
[26]
Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , year =
Yin, Pengcheng and Deng, Bowen and Chen, Edgar and Vasilescu, Bogdan and Neubig, Graham , doi =. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , year =. IEEE Working Conference on Mining Software Repositories , pages =
-
[27]
CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models , year =
Yu, Hao and Shen, Bo and Ran, Dezhi and Zhang, Jiaxin and Zhang, Qi and Ma, Yuchi and Liang, Guangtai and Li, Ying and Wang, Qianxiang and Xie, Tao , doi =. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models , year =. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages =
-
[28]
International Conference on Learning Representations , title =
Zhuo, Terry Yue and Vu, Minh Chien and Chim, Jenny and Hu, Han and Yu, Wenhao and Widyasari, Ratnadira and Yusuf, Imam Nur Bani and Zhan, Haolan and He, Junda and Paul, Indraneil and others , doi =. International Conference on Learning Representations , title =
-
[29]
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories , author =. arXiv.org , year =
-
[30]
ACM Computing Surveys , volume =
Survey of Hallucination in Natural Language Generation , author =. ACM Computing Surveys , volume =. 2022 , doi =
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.