pith. sign in

arxiv: 2509.22202 · v3 · pith:JL6HB3QNnew · submitted 2025-09-26 · 💻 cs.SE · cs.CL

Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries

Pith reviewed 2026-05-21 22:27 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords library hallucinationsLLM-generated codeprompt variationscode generation riskssoftware supply chainAI hallucinationsbenchmark creation
0
0 comments X

The pith

Small prompt changes like one-character misspellings cause LLMs to invent non-existent libraries in up to 26% of code tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how realistic variations in user prompts affect the tendency of large language models to generate code that references non-existent libraries. It shows that even minor errors, such as single-character misspellings, can lead to invalid imports in a significant portion of generated code, while completely made-up library names are often accepted without question. These issues are studied across multiple models and prompt types, including those involving time references. The authors create a benchmark called LibHalluBench to test and measure these hallucinations systematically. Understanding these patterns is important for developers who rely on AI for coding, as it highlights potential points of failure in builds and security.

Core claim

The central discovery is that library hallucinations in LLM-generated code are highly sensitive to user prompt variations. Specifically, one-character misspellings trigger hallucinations in up to 26% of tasks, fabricated library names are accepted in up to 99% of cases, and time-based prompts induce hallucinations in up to 85%. The study analyzes both library name hallucinations involving invalid imports and library member hallucinations involving invalid calls from valid libraries across seven diverse LLMs. These findings are used to ground the introduction of LibHalluBench, a benchmark for reproducible evaluation of such hallucinations.

What carries the argument

Controlled variations in developer prompts, including misspellings, fabricated library and member names, to measure rates of invalid imports and invalid function calls in generated code.

Load-bearing premise

The specific prompt variations and the seven LLMs tested are representative of how developers actually query code generation tools and make mistakes.

What would settle it

Running the same set of prompt variations on additional LLMs not included in the study or on real-world logs of developer interactions with code assistants, and observing whether the hallucination rates remain consistent.

read the original abstract

Large language models (LLMs) now play a central role in code generation, yet they continue to hallucinate, frequently inventing non-existent libraries. Such library hallucinations are not just benign errors: they can mislead developers, break builds, and expose systems to supply chain threats such as slopsquatting. Despite growing awareness of these risks, there is limited understanding of how library hallucinations manifest under realistic usage conditions. To fill this gap, we present the first systematic study of how user-level prompt variations influence library hallucinations in LLM-generated code. Across seven diverse LLMs, we analyse library name hallucinations (invalid imports) and library member hallucinations (invalid calls from valid libraries), examining the effects of realistic developer language and controlled user mistakes, including misspellings and fabricated libraries or members. Our findings expose systemic vulnerabilities: one-character misspellings trigger hallucinations in up to 26% of tasks; fabricated library names are accepted in up to 99%; and time-based prompts induce hallucinations in up to 85%. Grounded in the highest-risk prompts identified in our study, we introduce LibHalluBench, a benchmark that enables a systematic and reproducible evaluation of these library hallucinations. Our findings underscore the fragility of LLMs to natural prompt variation and highlight the urgent need for safeguards against library-related hallucinations and their downstream risks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the first systematic empirical study of library hallucinations (invalid imports and invalid member calls) in code generated by seven LLMs. It examines the influence of realistic developer prompt variations, including one-character misspellings, fabricated library names, and time-based prompts, reports quantitative hallucination rates (up to 26%, 99%, and 85% respectively), and introduces LibHalluBench as a benchmark derived from the highest-risk prompts identified.

Significance. If the central measurements are shown to be robust, the work is significant because it quantifies how common, low-effort prompt variations can produce high rates of library hallucinations with downstream security implications (e.g., slopsquatting). The introduction of LibHalluBench is a constructive contribution that could support reproducible follow-on evaluation in the LLM-for-code literature.

major comments (1)
  1. [Section 3 and Section 4] Section 3 (Methodology) and Section 4 (Results): the central claims attribute specific hallucination rates to the tested prompt variations (e.g., 'one-character misspellings trigger hallucinations in up to 26% of tasks'). However, the reported experiments do not include or reference paired baseline measurements on matched prompts that use correct spellings and valid library names. Without these controls it is not possible to determine whether the observed rates exceed the models' baseline hallucination propensity on the chosen tasks, weakening the causal language of 'trigger' and 'influence'.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'up to 26%' (and similar maxima) should be accompanied by the specific model, task count, and prompt template that produced the maximum so readers can assess the scope of the reported effect.
  2. [Section 3] The manuscript would benefit from an explicit statement of the total number of prompts, tasks, and generations per condition, together with any statistical tests or confidence intervals used to support the reported percentages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the work's significance and for this constructive comment on experimental controls. We address the point below and have revised the manuscript to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Section 3 and Section 4] Section 3 (Methodology) and Section 4 (Results): the central claims attribute specific hallucination rates to the tested prompt variations (e.g., 'one-character misspellings trigger hallucinations in up to 26% of tasks'). However, the reported experiments do not include or reference paired baseline measurements on matched prompts that use correct spellings and valid library names. Without these controls it is not possible to determine whether the observed rates exceed the models' baseline hallucination propensity on the chosen tasks, weakening the causal language of 'trigger' and 'influence'.

    Authors: We agree that explicit paired baselines on matched prompts with correct spellings and valid library names would allow clearer isolation of the incremental effect of the variations. Our original design emphasized realistic developer prompt conditions rather than exhaustive controls, but we acknowledge this limits strong causal attribution. In the revised manuscript we have added these baseline conditions to Section 3 and report comparative hallucination rates in Section 4, showing that rates under the varied prompts are substantially higher than the matched correct-prompt baselines. We have updated the abstract, results, and discussion to frame the findings in terms of relative increases rather than absolute triggering, while retaining the quantitative rates observed under each condition. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurement study with direct experimental outcomes

full rationale

This paper conducts an empirical study testing library hallucinations across seven LLMs under controlled prompt variations including misspellings, fabricated names, and time-based prompts. The central claims report observed hallucination rates (e.g., up to 26% for one-character misspellings) as direct results from the experiments rather than any mathematical derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or first-principles chains are present that reduce outputs to inputs by construction; LibHalluBench is introduced as a benchmark grounded in the highest-risk prompts identified experimentally. The analysis is self-contained against external benchmarks with no reduction to prior author work or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study of LLM behavior under varied prompts. No mathematical derivations, fitted parameters, or new postulated entities are introduced beyond the creation of the evaluation benchmark.

pith-pipeline@v0.9.0 · 5773 in / 1154 out tokens · 65270 ms · 2026-05-21T22:27:21.690568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Hallucination Inspector: A Fact-Checking Judge for API Migration

    cs.SE 2026-04 unverdicted novelty 6.0

    Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives vers...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    CodeMirage : Hallucinations in Code Generated by Large Language Models , 2024

    Vibhor Agarwal, Yulong Pei, Salwa Alamir, and Xiaomo Liu. CodeMirage : Hallucinations in Code Generated by Large Language Models , 2024. URL https://arxiv.org/abs/2408.08333v1

  3. [3]

    37 Hidden Python Libraries That Are Absolute Gems , 2023

    Avi Chawla. 37 Hidden Python Libraries That Are Absolute Gems , 2023. URL https://blog.dailydoseofds.com/p/gem-libraries

  4. [4]

    A Survey on Evaluating Large Language Models in Code Generation Tasks

    Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, and Shikun Zhang. A Survey on Evaluating Large Language Models in Code Generation Tasks . 2024. doi:10.48550/ARXIV.2408.16498. URL https://arxiv.org/abs/2408.16498

  5. [5]

    Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , 2025

    Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma. Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , 2025. URL http://arxiv.org/abs/2505.05057

  6. [6]

    Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024

    Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated Data : Tracing Knowledge Cutoffs in Large Language Models . 2024. doi:10.48550/ARXIV.2403.12958. URL https://arxiv.org/abs/2403.12958

  7. [7]

    Extended Syntax | Markdown Guide , 2025

    Matt Cone. Extended Syntax | Markdown Guide , 2025. URL https://www.markdownguide.org/extended-syntax/

  8. [8]

    Measuring dependency freshness in software systems

    Joël Cox, Eric Bouwers, Marko van Eekelen, and Joost Visser. Measuring dependency freshness in software systems. In Proceedings of the 37th International Conference on Software Engineering - Volume 2 , ICSE '15, pp.\ 109--118. IEEE Press, 2015

  9. [9]

    DeepSeek-V3 .1 Release , 2025

    DeepSeek. DeepSeek-V3 .1 Release , 2025. URL https://api-docs.deepseek.com/news/news250821

  10. [10]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  11. [11]

    NL-Augmenter : A Framework for Task-Sensitive Natural Language Augmentation

    Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshang Wu, Jascha Sohl-Dickstein, Jinho Choi, Eduard Hovy, Ondřej Dušek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Car...

  12. [12]

    Studying How Configurations Impact Code Generation in LLMs : The Case of ChatGPT

    Benedetta Donato, Leonardo Mariani, Daniela Micucci, and Oliviero Riganelli. Studying How Configurations Impact Code Generation in LLMs : The Case of ChatGPT . In The Proceedings of the 33rd IEEE / ACM International Conference on Program Comprehension . arXiv, February 2025. doi:10.48550/arXiv.2502.17450

  13. [13]

    De- Hallucinator : Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding , 2024

    Aryaz Eghbali and Michael Pradel. De- Hallucinator : Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding , 2024. URL http://arxiv.org/abs/2401.01701

  14. [14]

    Using digital traces to analyze software work: skills, careers and programming languages

    Xiangnan Feng, Johannes Wachs, Simone Daniotti, and Frank Neffke. The building blocks of software work explain coding careers and language popularity, 2025. URL http://arxiv.org/abs/2504.03581

  15. [15]

    10 Little-Known Python Libraries That Will Make You Feel Like a Data Wizard , 2025

    Josep Ferrer. 10 Little-Known Python Libraries That Will Make You Feel Like a Data Wizard , 2025. URL https://www.kdnuggets.com/10-little-known-python-libraries-that-will-make-you-feel-like-a-data-wizard

  16. [16]

    Reasoning Robustness of LLMs to Adversarial Typographical Errors

    Esther Gan, Yiran Zhao, Liying Cheng, Mao Yancan, Anirudh Goyal, Kenji Kawaguchi, Min-Yen Kan, and Michael Shieh. Reasoning Robustness of LLMs to Adversarial Typographical Errors . In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pp.\ 10449--10459. Associa...

  17. [17]

    Research: Quantifying GitHub Copilot ’s impact in the enterprise with Accenture , 2024

    Ya Gao and GitHub Customer Research. Research: Quantifying GitHub Copilot ’s impact in the enterprise with Accenture , 2024. URL https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-in-the-enterprise-with-accenture/

  18. [18]

    Auditing Prompt Caching in Language Model APIs , February 2025

    Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. Auditing Prompt Caching in Language Model APIs , February 2025

  19. [19]

    GitHub Typo Corpus : A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors , 2019

    Masato Hagiwara and Masato Mita. GitHub Typo Corpus : A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors , 2019. URL http://arxiv.org/abs/1911.12893

  20. [20]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A Survey on Hallucination in Large Language Models : Principles , Taxonomy , Challenges , and Open Questions , 2023. URL http://arxiv.org/abs/2311.05232

  21. [21]

    Qwen2.5-Coder Technical Report

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder Technical Report , 2024. URL http://arxiv.org/...

  22. [22]

    On Mitigating Code LLM Hallucinations with API Documentation , 2024

    Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. On Mitigating Code LLM Hallucinations with API Documentation , 2024. URL http://arxiv.org/abs/2407.09726

  23. [23]

    Survey of Hallucination in Natural Language Generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation . 55 0 (12): 0 248:1--248:38, 2023. ISSN 0360-0300. doi:10.1145/3571730. URL https://doi.org/10.1145/3571730

  24. [24]

    A Survey on Large Language Models for Code Generation

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A Survey on Large Language Models for Code Generation , 2024 a . URL http://arxiv.org/abs/2406.00515

  25. [25]

    A survey on large language model hallucination via a creativity perspective

    Xuhui Jiang, Yuxing Tian, Fengrui Hua, Chengjin Xu, Yuanzhuo Wang, and Jian Guo. A Survey on Large Language Model Hallucination via a Creativity Perspective , 2024 b . URL http://arxiv.org/abs/2402.06647

  26. [26]

    Importing Phantoms : Measuring LLM Package Hallucination Vulnerabilities , 2025

    Arjun Krishna, Erick Galinkin, Leon Derczynski, and Jeffrey Martin. Importing Phantoms : Measuring LLM Package Hallucination Vulnerabilities , 2025. URL http://arxiv.org/abs/2501.19012

  27. [27]

    Selecting third-party libraries: The practitioners’ perspective

    Enrique Larios Vargas, Maurício Aniche, Christoph Treude, Magiel Bruntink, and Georgios Gousios. Selecting third-party libraries: The practitioners’ perspective. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ESEC / FSE 2020, pp.\ 245--256. Association for...

  28. [28]

    Is ChatGPT a Good Software Librarian ? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations , 2024

    Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. Is ChatGPT a Good Software Librarian ? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations , 2024. URL http://arxiv.org/abs/2408.05128

  29. [29]

    Hallucination by Code Generation LLMs : Taxonomy , Benchmarks , Mitigation , and Challenges , 2025

    Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam. Hallucination by Code Generation LLMs : Taxonomy , Benchmarks , Mitigation , and Challenges , 2025. URL http://arxiv.org/abs/2504.20799

  30. [30]

    Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971, 2024

    Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation , 2024. URL https://arxiv.org/abs/2404.00971v2

  31. [31]

    Self- Reflection Makes Large Language Models Safer , Less Biased , and Ideologically Neutral , 2025

    Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, and Talal Rahwan. Self- Reflection Makes Large Language Models Safer , Less Biased , and Ideologically Neutral , 2025. URL http://arxiv.org/abs/2406.10400

  32. [32]

    In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings

    Mingwei Liu, Tianyong Yang, Yiling Lou, Xueying Du, Ying Wang, and Xin Peng. CodeGen4Libs : A Two-Stage Approach for Library-Oriented Code Generation . In 2023 38th IEEE / ACM International Conference on Automated Software Engineering ( ASE ) , pp.\ 434--445. IEEE, 2023. ISBN 979-8-3503-2996-4. doi:10.1109/ASE56229.2023.00159. URL https://ieeexplore.ieee....

  33. [33]

    Llama 3.3 | Model Cards and Prompt formats, 2025

    Meta. Llama 3.3 | Model Cards and Prompt formats, 2025. URL https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/

  34. [34]

    Un Ministral , des Ministraux | Mistral AI , 2025

    MistralAI. Un Ministral , des Ministraux | Mistral AI , 2025. URL https://mistral.ai/news/ministraux

  35. [35]

    A Closer Look at System Prompt Robustness , 2025

    Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A Closer Look at System Prompt Robustness , 2025. URL http://arxiv.org/abs/2502.12197

  36. [36]

    The Dynamics of Innovation in Open Source Software Ecosystems , 2024

    Gábor Mészáros and Johannes Wachs. The Dynamics of Innovation in Open Source Software Ecosystems , 2024. URL http://arxiv.org/abs/2411.14894

  37. [37]

    Beyond typosquatting: An in-depth look at package confusion

    Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. Beyond typosquatting: An in-depth look at package confusion. In Proceedings of the 32nd USENIX Conference on Security Symposium , SEC '23, pp.\ 3439--3456. USENIX Association, 2023. ISBN 978-1-939133-37-3

  38. [38]

    Satya Nadella says as much as 30\ URL https://www.nbclosangeles.com/news/business/money-report/satya-nadella-says-as-much-as-30-of-microsoft-code-is-written-by-ai/3689617/

    Jordan Novet and Jonathan Vanian. Satya Nadella says as much as 30\ URL https://www.nbclosangeles.com/news/business/money-report/satya-nadella-says-as-much-as-30-of-microsoft-code-is-written-by-ai/3689617/

  39. [39]

    GPT-4o mini - API , 2025 a

    OpenAI. GPT-4o mini - API , 2025 a . URL https://platform.openai.com/docs/models/gpt-4o-mini

  40. [40]

    GPT-5 mini - API , 2025 b

    OpenAI. GPT-5 mini - API , 2025 b . URL https://platform.openai.com

  41. [41]

    Slopsquatting: Hallucination in Coding Agents and Vibe Coding , 2025

    Sean Park. Slopsquatting: Hallucination in Coding Agents and Vibe Coding , 2025. URL https://www.trendmicro.com/vinfo/gb/security/news/cybercrime-and-digital-threats/slopsquatting-when-ai-agents-hallucinate-malicious-packages

  42. [42]

    Gorilla: Large Language Model Connected with Massive APIs

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs , 2023. URL http://arxiv.org/abs/2305.15334

  43. [43]

    Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback

    Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. Check Your Facts and Try Again : Improving Large Language Models with External Knowledge and Automated Feedback , 2023. URL http://arxiv.org/abs/2302.12813

  44. [44]

    Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, and Jared Kaplan

    Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...

  45. [45]

    Ast — Abstract Syntax Trees , 2025

    Python Software Foundation PSF. Ast — Abstract Syntax Trees , 2025. URL https://docs.python.org/3/library/ast.html

  46. [46]

    Names and normalization - Python Packaging User Guide , 2025

    PyPA. Names and normalization - Python Packaging User Guide , 2025. URL https://packaging.python.org/en/latest/specifications/name-normalization/

  47. [47]

    PyPI · The Python Package Index , 2025

    PyPI. PyPI · The Python Package Index , 2025. URL https://pypi.org/

  48. [48]

    The role of library versions in Developer-ChatGPT conversations, 2024

    Rachna Raj and Diego Elias Costa. The role of library versions in Developer-ChatGPT conversations, 2024. URL http://arxiv.org/abs/2401.16340

  49. [49]

    doi: 10.18653/v1/D19-1410

    Nils Reimers and Iryna Gurevych. Sentence- BERT : Sentence Embeddings using Siamese BERT-Networks . In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ) , pp.\ 3982--3...

  50. [50]

    Large language models reduce public knowledge sharing on online Q & A platforms

    R Maria del Rio-Chanona, Nadzeya Laurentsyeva, and Johannes Wachs. Large language models reduce public knowledge sharing on online Q & A platforms. 3 0 (9), 2024. doi:10.1093/pnasnexus/pgae400. URL https://dx.doi.org/10.1093/pnasnexus/pgae400

  51. [51]

    Large language model for vulnerability detection: Emerging results and future directions,

    June Sallou, Thomas Durieux, and Annibale Panichella. Breaking the Silence : The Threats of Using LLMs in Software Engineering . In Proceedings of the 2024 ACM / IEEE 44th International Conference on Software Engineering : New Ideas and Emerging Results , ICSE-NIER '24, pp.\ 102--106. Association for Computing Machinery, 2024. ISBN 979-8-4007-0500-7. doi:...

  52. [52]

    E. G. Santana Jr, Gabriel Benjamin, Melissa Araujo, Harrison Santos, David Freitas, Eduardo Almeida, Paulo Anselmo da M. S. Neto, Jiawei Li, Jina Chun, and Iftekhar Ahmed. Which Prompting Technique Should I Use ? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks , 2025. URL http://arxiv.org/abs/2506.05614

  53. [53]

    AgglomerativeClustering , 2025 a

    scikit learn. AgglomerativeClustering , 2025 a . URL https://scikit-learn/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

  54. [54]

    CountVectorizer , 2025 b

    scikit learn. CountVectorizer , 2025 b . URL https://scikit-learn/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

  55. [55]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards Understanding Sycophancy in Language Models . 2024: 0 110--144, 2024. URL https:...

  56. [56]

    Software Engineering , Global Edition

    Ian Somerville. Software Engineering , Global Edition . Pearson Education, 2016. ISBN 978-1-292-09614-8

  57. [57]

    Misspellings in Natural Language Processing : A survey, 2025

    Gianluca Sperduti and Alejandro Moreo. Misspellings in Natural Language Processing : A survey, 2025. URL http://arxiv.org/abs/2501.16836

  58. [58]

    Joseph Spracklen, Raveen Wijewickrama, A. H. M. Nazmus Sakib, Anindya Maiti, Bimal Viswanath, and Murtuza Jadliwala. We Have a Package for You ! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs , 2024. URL http://arxiv.org/abs/2406.10279

  59. [59]

    Confabulation: The Surprising Value of Large Language Model Hallucinations , 2024

    Peiqi Sui, Eamon Duede, Sophie Wu, and Richard Jean So. Confabulation: The Surprising Value of Large Language Model Hallucinations , 2024. URL http://arxiv.org/abs/2406.04175

  60. [60]

    Adv- BERT : BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020

    Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari Asai, Jia Li, Philip Yu, and Caiming Xiong. Adv- BERT : BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020. URL http://arxiv.org/abs/2003.04985

  61. [61]

    How do people decide?

    Minaoar Hossain Tanzil, Gias Uddin, and Ann Barcomb. " How do people decide?": A Model for Software Library Selection . In Proceedings of the 2024 IEEE / ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering , pp.\ 1--12, 2024. doi:10.1145/3641822.3641865. URL http://arxiv.org/abs/2403.16245

  62. [62]

    Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi

    Matthew Taylor, Ruturaj K. Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi. SpellBound : Defending Against Package Typosquatting , 2020. URL http://arxiv.org/abs/2003.03471

  63. [63]

    CodeHalu : Investigating Code Hallucinations in LLMs via Execution-based Verification , 2024

    Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, and Dawn Song. CodeHalu : Investigating Code Hallucinations in LLMs via Execution-based Verification , 2024. URL https://arxiv.org/abs/2405.00253v3

  64. [64]

    A Study of LLMs' Preferences for Libraries and Programming Languages

    Lukas Twist, Jie M. Zhang, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, and Detlef Nauck. A Study of LLMs ' Preferences for Libraries and Programming Languages , 2025. URL http://arxiv.org/abs/2503.17181

  65. [65]

    Mind Your Format : Towards Consistent Evaluation of In-Context Learning Improvements

    Anton Voronov, Lena Wolf, and Max Ryabinin. Mind Your Format : Towards Consistent Evaluation of In-Context Learning Improvements . In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics : ACL 2024 , pp.\ 6287--6310. Association for Computational Linguistics, 2024. doi:10.18653/v1/2024.findings-ac...

  66. [66]

    Chaozheng Wang, Shuzheng Gao, Cuiyun Gao, Wenxuan Wang, Chun Yong Chong, Shan Gao, and Michael R. Lyu. A Systematic Evaluation of Large Code Models in API Suggestion : When , Which , and How . In Proceedings of the 39th \ \ IEEE / ACM \ \ International Conference on Automated Software Engineering , \ \ ASE \ \ 2024, Sacramento , CA , USA , October 27 - No...

  67. [67]

    LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM-based Code Completion

    Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM-based Code Completion . In Proceedings of 47th International Conference on Software Engineering ( ICSE 2025) . arXiv, 2025. doi:10.48550/arXiv.2406.09834. URL http://arxiv.org/abs/2406.09834

  68. [68]

    ExploraCoder : Advancing code generation for multiple unseen APIs via planning and chained exploration, 2024 b

    Yunkun Wang, Yue Zhang, Zhen Qin, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, and Shuiguang Deng. ExploraCoder : Advancing code generation for multiple unseen APIs via planning and chained exploration, 2024 b . URL http://arxiv.org/abs/2412.05366

  69. [69]

    Execution- Based Evaluation for Open-Domain Code Generation

    Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution- Based Evaluation for Open-Domain Code Generation . In Findings of the Association for Computational Linguistics : EMNLP 2023, Singapore , December 6-10, 2023 . arXiv, May 2023. doi:10.48550/arXiv.2212.10481

  70. [70]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems , NIPS '22, pp.\ 24824--24837. Curran Associates Inc., 2022. ISBN 978-1-7138-7108-8

  71. [71]

    DevGPT : Studying Developer-ChatGPT Conversations

    Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto. DevGPT : Studying Developer-ChatGPT Conversations . In Proceedings of the 21st International Conference on Mining Software Repositories , pp.\ 227--230, 2024. doi:10.1145/3643991.3648400. URL http://arxiv.org/abs/2309.03914

  72. [72]

    CERT : Continual Pre-training on Sketches for Library-oriented Code Generation

    Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. CERT : Continual Pre-training on Sketches for Library-oriented Code Generation . In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence , pp.\ 2369--2375. International Joint Conferences on Artificial Inte...

  73. [73]

    Private- Library-Oriented Code Generation with Large Language Models , 2023

    Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. Private- Library-Oriented Code Generation with Large Language Models , 2023. URL http://arxiv.org/abs/2307.15370

  74. [74]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren's Song in the AI Ocean : A Survey on Hallucination in Large Language Models . 2023. doi:10.48550/ARXIV.2309.01219. URL https://arxiv.org/abs/2309.01219

  75. [75]

    LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , 2024

    Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , 2024. URL https://arxiv.org/abs/2409.20550v1

  76. [76]

    Retrieval-Augmented Generation for AI-Generated Content: A Survey

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval- Augmented Generation for AI-Generated Content : A Survey , 2024. URL http://arxiv.org/abs/2402.19473

  77. [77]

    Chi, Quoc V

    Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou. Take a Step Back : Evoking Reasoning via Abstraction in Large Language Models . In 14th International Conference on Learning Representations ( ICLR24 ) . arXiv, 2024. doi:10.48550/arXiv.2310.06117. URL http://arxiv.org/abs/2310.06117

  78. [78]

    Can LLM replace stack overflow? a study on robustness and reliability of large language model code generation

    Li Zhong and Zilong Wang. Can LLM replace stack overflow? a study on robustness and reliability of large language model code generation. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artific...

  79. [79]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...

  80. [80]

    Identifying and Mitigating API Misuse in Large Language Models , 2025

    Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. Identifying and Mitigating API Misuse in Large Language Models , 2025. URL http://arxiv.org/abs/2503.22821

Showing first 80 references.