Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
Pith reviewed 2026-05-21 22:27 UTC · model grok-4.3
The pith
Small prompt changes like one-character misspellings cause LLMs to invent non-existent libraries in up to 26% of code tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that library hallucinations in LLM-generated code are highly sensitive to user prompt variations. Specifically, one-character misspellings trigger hallucinations in up to 26% of tasks, fabricated library names are accepted in up to 99% of cases, and time-based prompts induce hallucinations in up to 85%. The study analyzes both library name hallucinations involving invalid imports and library member hallucinations involving invalid calls from valid libraries across seven diverse LLMs. These findings are used to ground the introduction of LibHalluBench, a benchmark for reproducible evaluation of such hallucinations.
What carries the argument
Controlled variations in developer prompts, including misspellings, fabricated library and member names, to measure rates of invalid imports and invalid function calls in generated code.
Load-bearing premise
The specific prompt variations and the seven LLMs tested are representative of how developers actually query code generation tools and make mistakes.
What would settle it
Running the same set of prompt variations on additional LLMs not included in the study or on real-world logs of developer interactions with code assistants, and observing whether the hallucination rates remain consistent.
read the original abstract
Large language models (LLMs) now play a central role in code generation, yet they continue to hallucinate, frequently inventing non-existent libraries. Such library hallucinations are not just benign errors: they can mislead developers, break builds, and expose systems to supply chain threats such as slopsquatting. Despite growing awareness of these risks, there is limited understanding of how library hallucinations manifest under realistic usage conditions. To fill this gap, we present the first systematic study of how user-level prompt variations influence library hallucinations in LLM-generated code. Across seven diverse LLMs, we analyse library name hallucinations (invalid imports) and library member hallucinations (invalid calls from valid libraries), examining the effects of realistic developer language and controlled user mistakes, including misspellings and fabricated libraries or members. Our findings expose systemic vulnerabilities: one-character misspellings trigger hallucinations in up to 26% of tasks; fabricated library names are accepted in up to 99%; and time-based prompts induce hallucinations in up to 85%. Grounded in the highest-risk prompts identified in our study, we introduce LibHalluBench, a benchmark that enables a systematic and reproducible evaluation of these library hallucinations. Our findings underscore the fragility of LLMs to natural prompt variation and highlight the urgent need for safeguards against library-related hallucinations and their downstream risks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first systematic empirical study of library hallucinations (invalid imports and invalid member calls) in code generated by seven LLMs. It examines the influence of realistic developer prompt variations, including one-character misspellings, fabricated library names, and time-based prompts, reports quantitative hallucination rates (up to 26%, 99%, and 85% respectively), and introduces LibHalluBench as a benchmark derived from the highest-risk prompts identified.
Significance. If the central measurements are shown to be robust, the work is significant because it quantifies how common, low-effort prompt variations can produce high rates of library hallucinations with downstream security implications (e.g., slopsquatting). The introduction of LibHalluBench is a constructive contribution that could support reproducible follow-on evaluation in the LLM-for-code literature.
major comments (1)
- [Section 3 and Section 4] Section 3 (Methodology) and Section 4 (Results): the central claims attribute specific hallucination rates to the tested prompt variations (e.g., 'one-character misspellings trigger hallucinations in up to 26% of tasks'). However, the reported experiments do not include or reference paired baseline measurements on matched prompts that use correct spellings and valid library names. Without these controls it is not possible to determine whether the observed rates exceed the models' baseline hallucination propensity on the chosen tasks, weakening the causal language of 'trigger' and 'influence'.
minor comments (2)
- [Abstract] Abstract: the phrase 'up to 26%' (and similar maxima) should be accompanied by the specific model, task count, and prompt template that produced the maximum so readers can assess the scope of the reported effect.
- [Section 3] The manuscript would benefit from an explicit statement of the total number of prompts, tasks, and generations per condition, together with any statistical tests or confidence intervals used to support the reported percentages.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the work's significance and for this constructive comment on experimental controls. We address the point below and have revised the manuscript to strengthen the presentation of results.
read point-by-point responses
-
Referee: [Section 3 and Section 4] Section 3 (Methodology) and Section 4 (Results): the central claims attribute specific hallucination rates to the tested prompt variations (e.g., 'one-character misspellings trigger hallucinations in up to 26% of tasks'). However, the reported experiments do not include or reference paired baseline measurements on matched prompts that use correct spellings and valid library names. Without these controls it is not possible to determine whether the observed rates exceed the models' baseline hallucination propensity on the chosen tasks, weakening the causal language of 'trigger' and 'influence'.
Authors: We agree that explicit paired baselines on matched prompts with correct spellings and valid library names would allow clearer isolation of the incremental effect of the variations. Our original design emphasized realistic developer prompt conditions rather than exhaustive controls, but we acknowledge this limits strong causal attribution. In the revised manuscript we have added these baseline conditions to Section 3 and report comparative hallucination rates in Section 4, showing that rates under the varied prompts are substantially higher than the matched correct-prompt baselines. We have updated the abstract, results, and discussion to frame the findings in terms of relative increases rather than absolute triggering, while retaining the quantitative rates observed under each condition. revision: yes
Circularity Check
No circularity: empirical measurement study with direct experimental outcomes
full rationale
This paper conducts an empirical study testing library hallucinations across seven LLMs under controlled prompt variations including misspellings, fabricated names, and time-based prompts. The central claims report observed hallucination rates (e.g., up to 26% for one-character misspellings) as direct results from the experiments rather than any mathematical derivation, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or first-principles chains are present that reduce outputs to inputs by construction; LibHalluBench is introduced as a benchmark grounded in the highest-risk prompts identified experimentally. The analysis is self-contained against external benchmarks with no reduction to prior author work or ansatz smuggling.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Hallucination Inspector: A Fact-Checking Judge for API Migration
Hallucination Inspector verifies symbols in LLM-generated API migration code against a documentation-derived knowledge base using AST extraction, identifying scaffolding hallucinations and cutting false positives vers...
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
CodeMirage : Hallucinations in Code Generated by Large Language Models , 2024
Vibhor Agarwal, Yulong Pei, Salwa Alamir, and Xiaomo Liu. CodeMirage : Hallucinations in Code Generated by Large Language Models , 2024. URL https://arxiv.org/abs/2408.08333v1
-
[3]
37 Hidden Python Libraries That Are Absolute Gems , 2023
Avi Chawla. 37 Hidden Python Libraries That Are Absolute Gems , 2023. URL https://blog.dailydoseofds.com/p/gem-libraries
work page 2023
-
[4]
A Survey on Evaluating Large Language Models in Code Generation Tasks
Liguo Chen, Qi Guo, Hongrui Jia, Zhengran Zeng, Xin Wang, Yijiang Xu, Jian Wu, Yidong Wang, Qing Gao, Jindong Wang, Wei Ye, and Shikun Zhang. A Survey on Evaluating Large Language Models in Code Generation Tasks . 2024. doi:10.48550/ARXIV.2408.16498. URL https://arxiv.org/abs/2408.16498
-
[5]
Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma. Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware , 2025. URL http://arxiv.org/abs/2505.05057
-
[6]
Dated data: Tracing knowledge cutoffs in large language models.arXiv preprint arXiv:2403.12958, 2024
Jeffrey Cheng, Marc Marone, Orion Weller, Dawn Lawrie, Daniel Khashabi, and Benjamin Van Durme. Dated Data : Tracing Knowledge Cutoffs in Large Language Models . 2024. doi:10.48550/ARXIV.2403.12958. URL https://arxiv.org/abs/2403.12958
-
[7]
Extended Syntax | Markdown Guide , 2025
Matt Cone. Extended Syntax | Markdown Guide , 2025. URL https://www.markdownguide.org/extended-syntax/
work page 2025
-
[8]
Measuring dependency freshness in software systems
Joël Cox, Eric Bouwers, Marko van Eekelen, and Joost Visser. Measuring dependency freshness in software systems. In Proceedings of the 37th International Conference on Software Engineering - Volume 2 , ICSE '15, pp.\ 109--118. IEEE Press, 2015
work page 2015
-
[9]
DeepSeek. DeepSeek-V3 .1 Release , 2025. URL https://api-docs.deepseek.com/news/news250821
work page 2025
-
[10]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
NL-Augmenter : A Framework for Task-Sensitive Natural Language Augmentation
Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshang Wu, Jascha Sohl-Dickstein, Jinho Choi, Eduard Hovy, Ondřej Dušek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Car...
-
[12]
Studying How Configurations Impact Code Generation in LLMs : The Case of ChatGPT
Benedetta Donato, Leonardo Mariani, Daniela Micucci, and Oliviero Riganelli. Studying How Configurations Impact Code Generation in LLMs : The Case of ChatGPT . In The Proceedings of the 33rd IEEE / ACM International Conference on Program Comprehension . arXiv, February 2025. doi:10.48550/arXiv.2502.17450
-
[13]
Aryaz Eghbali and Michael Pradel. De- Hallucinator : Mitigating LLM Hallucinations in Code Generation Tasks via Iterative Grounding , 2024. URL http://arxiv.org/abs/2401.01701
-
[14]
Using digital traces to analyze software work: skills, careers and programming languages
Xiangnan Feng, Johannes Wachs, Simone Daniotti, and Frank Neffke. The building blocks of software work explain coding careers and language popularity, 2025. URL http://arxiv.org/abs/2504.03581
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
10 Little-Known Python Libraries That Will Make You Feel Like a Data Wizard , 2025
Josep Ferrer. 10 Little-Known Python Libraries That Will Make You Feel Like a Data Wizard , 2025. URL https://www.kdnuggets.com/10-little-known-python-libraries-that-will-make-you-feel-like-a-data-wizard
work page 2025
-
[16]
Reasoning Robustness of LLMs to Adversarial Typographical Errors
Esther Gan, Yiran Zhao, Liying Cheng, Mao Yancan, Anirudh Goyal, Kenji Kawaguchi, Min-Yen Kan, and Michael Shieh. Reasoning Robustness of LLMs to Adversarial Typographical Errors . In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pp.\ 10449--10459. Associa...
-
[17]
Research: Quantifying GitHub Copilot ’s impact in the enterprise with Accenture , 2024
Ya Gao and GitHub Customer Research. Research: Quantifying GitHub Copilot ’s impact in the enterprise with Accenture , 2024. URL https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-in-the-enterprise-with-accenture/
work page 2024
-
[18]
Auditing Prompt Caching in Language Model APIs , February 2025
Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, and Tatsunori Hashimoto. Auditing Prompt Caching in Language Model APIs , February 2025
work page 2025
-
[19]
Masato Hagiwara and Masato Mita. GitHub Typo Corpus : A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors , 2019. URL http://arxiv.org/abs/1911.12893
-
[20]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A Survey on Hallucination in Large Language Models : Principles , Taxonomy , Challenges , and Open Questions , 2023. URL http://arxiv.org/abs/2311.05232
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Qwen2.5-Coder Technical Report
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5- Coder Technical Report , 2024. URL http://arxiv.org/...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
On Mitigating Code LLM Hallucinations with API Documentation , 2024
Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. On Mitigating Code LLM Hallucinations with API Documentation , 2024. URL http://arxiv.org/abs/2407.09726
-
[23]
Survey of Hallucination in Natural Language Generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of Hallucination in Natural Language Generation . 55 0 (12): 0 248:1--248:38, 2023. ISSN 0360-0300. doi:10.1145/3571730. URL https://doi.org/10.1145/3571730
-
[24]
A Survey on Large Language Models for Code Generation
Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A Survey on Large Language Models for Code Generation , 2024 a . URL http://arxiv.org/abs/2406.00515
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
A survey on large language model hallucination via a creativity perspective
Xuhui Jiang, Yuxing Tian, Fengrui Hua, Chengjin Xu, Yuanzhuo Wang, and Jian Guo. A Survey on Large Language Model Hallucination via a Creativity Perspective , 2024 b . URL http://arxiv.org/abs/2402.06647
-
[26]
Importing Phantoms : Measuring LLM Package Hallucination Vulnerabilities , 2025
Arjun Krishna, Erick Galinkin, Leon Derczynski, and Jeffrey Martin. Importing Phantoms : Measuring LLM Package Hallucination Vulnerabilities , 2025. URL http://arxiv.org/abs/2501.19012
-
[27]
Selecting third-party libraries: The practitioners’ perspective
Enrique Larios Vargas, Maurício Aniche, Christoph Treude, Magiel Bruntink, and Georgios Gousios. Selecting third-party libraries: The practitioners’ perspective. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering , ESEC / FSE 2020, pp.\ 245--256. Association for...
-
[28]
Jasmine Latendresse, SayedHassan Khatoonabadi, Ahmad Abdellatif, and Emad Shihab. Is ChatGPT a Good Software Librarian ? An Exploratory Study on the Use of ChatGPT for Software Library Recommendations , 2024. URL http://arxiv.org/abs/2408.05128
-
[29]
Hallucination by Code Generation LLMs : Taxonomy , Benchmarks , Mitigation , and Challenges , 2025
Yunseo Lee, John Youngeun Song, Dongsun Kim, Jindae Kim, Mijung Kim, and Jaechang Nam. Hallucination by Code Generation LLMs : Taxonomy , Benchmarks , Mitigation , and Challenges , 2025. URL http://arxiv.org/abs/2504.20799
-
[30]
Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation , 2024. URL https://arxiv.org/abs/2404.00971v2
-
[31]
Self- Reflection Makes Large Language Models Safer , Less Biased , and Ideologically Neutral , 2025
Fengyuan Liu, Nouar AlDahoul, Gregory Eady, Yasir Zaki, and Talal Rahwan. Self- Reflection Makes Large Language Models Safer , Less Biased , and Ideologically Neutral , 2025. URL http://arxiv.org/abs/2406.10400
-
[32]
Mingwei Liu, Tianyong Yang, Yiling Lou, Xueying Du, Ying Wang, and Xin Peng. CodeGen4Libs : A Two-Stage Approach for Library-Oriented Code Generation . In 2023 38th IEEE / ACM International Conference on Automated Software Engineering ( ASE ) , pp.\ 434--445. IEEE, 2023. ISBN 979-8-3503-2996-4. doi:10.1109/ASE56229.2023.00159. URL https://ieeexplore.ieee....
-
[33]
Llama 3.3 | Model Cards and Prompt formats, 2025
Meta. Llama 3.3 | Model Cards and Prompt formats, 2025. URL https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3/
work page 2025
-
[34]
Un Ministral , des Ministraux | Mistral AI , 2025
MistralAI. Un Ministral , des Ministraux | Mistral AI , 2025. URL https://mistral.ai/news/ministraux
work page 2025
-
[35]
A Closer Look at System Prompt Robustness , 2025
Norman Mu, Jonathan Lu, Michael Lavery, and David Wagner. A Closer Look at System Prompt Robustness , 2025. URL http://arxiv.org/abs/2502.12197
-
[36]
The Dynamics of Innovation in Open Source Software Ecosystems , 2024
Gábor Mészáros and Johannes Wachs. The Dynamics of Innovation in Open Source Software Ecosystems , 2024. URL http://arxiv.org/abs/2411.14894
-
[37]
Beyond typosquatting: An in-depth look at package confusion
Shradha Neupane, Grant Holmes, Elizabeth Wyss, Drew Davidson, and Lorenzo De Carli. Beyond typosquatting: An in-depth look at package confusion. In Proceedings of the 32nd USENIX Conference on Security Symposium , SEC '23, pp.\ 3439--3456. USENIX Association, 2023. ISBN 978-1-939133-37-3
work page 2023
-
[38]
Jordan Novet and Jonathan Vanian. Satya Nadella says as much as 30\ URL https://www.nbclosangeles.com/news/business/money-report/satya-nadella-says-as-much-as-30-of-microsoft-code-is-written-by-ai/3689617/
-
[39]
OpenAI. GPT-4o mini - API , 2025 a . URL https://platform.openai.com/docs/models/gpt-4o-mini
work page 2025
-
[40]
OpenAI. GPT-5 mini - API , 2025 b . URL https://platform.openai.com
work page 2025
-
[41]
Slopsquatting: Hallucination in Coding Agents and Vibe Coding , 2025
Sean Park. Slopsquatting: Hallucination in Coding Agents and Vibe Coding , 2025. URL https://www.trendmicro.com/vinfo/gb/security/news/cybercrime-and-digital-threats/slopsquatting-when-ai-agents-hallucinate-malicious-packages
work page 2025
-
[42]
Gorilla: Large Language Model Connected with Massive APIs
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language Model Connected with Massive APIs , 2023. URL http://arxiv.org/abs/2305.15334
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. Check Your Facts and Try Again : Improving Large Language Models with External Knowledge and Automated Feedback , 2023. URL http://arxiv.org/abs/2302.12813
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Ke...
-
[45]
Ast — Abstract Syntax Trees , 2025
Python Software Foundation PSF. Ast — Abstract Syntax Trees , 2025. URL https://docs.python.org/3/library/ast.html
work page 2025
-
[46]
Names and normalization - Python Packaging User Guide , 2025
PyPA. Names and normalization - Python Packaging User Guide , 2025. URL https://packaging.python.org/en/latest/specifications/name-normalization/
work page 2025
-
[47]
PyPI · The Python Package Index , 2025
PyPI. PyPI · The Python Package Index , 2025. URL https://pypi.org/
work page 2025
-
[48]
The role of library versions in Developer-ChatGPT conversations, 2024
Rachna Raj and Diego Elias Costa. The role of library versions in Developer-ChatGPT conversations, 2024. URL http://arxiv.org/abs/2401.16340
-
[49]
Nils Reimers and Iryna Gurevych. Sentence- BERT : Sentence Embeddings using Siamese BERT-Networks . In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ) , pp.\ 3982--3...
-
[50]
Large language models reduce public knowledge sharing on online Q & A platforms
R Maria del Rio-Chanona, Nadzeya Laurentsyeva, and Johannes Wachs. Large language models reduce public knowledge sharing on online Q & A platforms. 3 0 (9), 2024. doi:10.1093/pnasnexus/pgae400. URL https://dx.doi.org/10.1093/pnasnexus/pgae400
-
[51]
Large language model for vulnerability detection: Emerging results and future directions,
June Sallou, Thomas Durieux, and Annibale Panichella. Breaking the Silence : The Threats of Using LLMs in Software Engineering . In Proceedings of the 2024 ACM / IEEE 44th International Conference on Software Engineering : New Ideas and Emerging Results , ICSE-NIER '24, pp.\ 102--106. Association for Computing Machinery, 2024. ISBN 979-8-4007-0500-7. doi:...
-
[52]
E. G. Santana Jr, Gabriel Benjamin, Melissa Araujo, Harrison Santos, David Freitas, Eduardo Almeida, Paulo Anselmo da M. S. Neto, Jiawei Li, Jina Chun, and Iftekhar Ahmed. Which Prompting Technique Should I Use ? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks , 2025. URL http://arxiv.org/abs/2506.05614
-
[53]
AgglomerativeClustering , 2025 a
scikit learn. AgglomerativeClustering , 2025 a . URL https://scikit-learn/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
work page 2025
-
[54]
scikit learn. CountVectorizer , 2025 b . URL https://scikit-learn/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
work page 2025
-
[55]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomek Korbak, David Duvenaud, Amanda Askell, Sam Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards Understanding Sycophancy in Language Models . 2024: 0 110--144, 2024. URL https:...
work page 2024
-
[56]
Software Engineering , Global Edition
Ian Somerville. Software Engineering , Global Edition . Pearson Education, 2016. ISBN 978-1-292-09614-8
work page 2016
-
[57]
Misspellings in Natural Language Processing : A survey, 2025
Gianluca Sperduti and Alejandro Moreo. Misspellings in Natural Language Processing : A survey, 2025. URL http://arxiv.org/abs/2501.16836
- [58]
-
[59]
Confabulation: The Surprising Value of Large Language Model Hallucinations , 2024
Peiqi Sui, Eamon Duede, Sophie Wu, and Richard Jean So. Confabulation: The Surprising Value of Large Language Model Hallucinations , 2024. URL http://arxiv.org/abs/2406.04175
-
[60]
Adv- BERT : BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020
Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari Asai, Jia Li, Philip Yu, and Caiming Xiong. Adv- BERT : BERT is not robust on misspellings! Generating nature adversarial samples on BERT , 2020. URL http://arxiv.org/abs/2003.04985
-
[61]
Minaoar Hossain Tanzil, Gias Uddin, and Ann Barcomb. " How do people decide?": A Model for Software Library Selection . In Proceedings of the 2024 IEEE / ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering , pp.\ 1--12, 2024. doi:10.1145/3641822.3641865. URL http://arxiv.org/abs/2403.16245
-
[62]
Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi
Matthew Taylor, Ruturaj K. Vaidya, Drew Davidson, Lorenzo De Carli, and Vaibhav Rastogi. SpellBound : Defending Against Package Typosquatting , 2020. URL http://arxiv.org/abs/2003.03471
-
[63]
CodeHalu : Investigating Code Hallucinations in LLMs via Execution-based Verification , 2024
Yuchen Tian, Weixiang Yan, Qian Yang, Xuandong Zhao, Qian Chen, Wen Wang, Ziyang Luo, Lei Ma, and Dawn Song. CodeHalu : Investigating Code Hallucinations in LLMs via Execution-based Verification , 2024. URL https://arxiv.org/abs/2405.00253v3
-
[64]
A Study of LLMs' Preferences for Libraries and Programming Languages
Lukas Twist, Jie M. Zhang, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, and Detlef Nauck. A Study of LLMs ' Preferences for Libraries and Programming Languages , 2025. URL http://arxiv.org/abs/2503.17181
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[65]
Mind Your Format : Towards Consistent Evaluation of In-Context Learning Improvements
Anton Voronov, Lena Wolf, and Max Ryabinin. Mind Your Format : Towards Consistent Evaluation of In-Context Learning Improvements . In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics : ACL 2024 , pp.\ 6287--6310. Association for Computational Linguistics, 2024. doi:10.18653/v1/2024.findings-ac...
-
[66]
Chaozheng Wang, Shuzheng Gao, Cuiyun Gao, Wenxuan Wang, Chun Yong Chong, Shan Gao, and Michael R. Lyu. A Systematic Evaluation of Large Code Models in API Suggestion : When , Which , and How . In Proceedings of the 39th \ \ IEEE / ACM \ \ International Conference on Automated Software Engineering , \ \ ASE \ \ 2024, Sacramento , CA , USA , October 27 - No...
-
[67]
LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM-based Code Completion
Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. LLMs Meet Library Evolution : Evaluating Deprecated API Usage in LLM-based Code Completion . In Proceedings of 47th International Conference on Software Engineering ( ICSE 2025) . arXiv, 2025. doi:10.48550/arXiv.2406.09834. URL http://arxiv.org/abs/2406.09834
-
[68]
Yunkun Wang, Yue Zhang, Zhen Qin, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, and Shuiguang Deng. ExploraCoder : Advancing code generation for multiple unseen APIs via planning and chained exploration, 2024 b . URL http://arxiv.org/abs/2412.05366
-
[69]
Execution- Based Evaluation for Open-Domain Code Generation
Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution- Based Evaluation for Open-Domain Code Generation . In Findings of the Association for Computational Linguistics : EMNLP 2023, Singapore , December 6-10, 2023 . arXiv, May 2023. doi:10.48550/arXiv.2212.10481
-
[70]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems , NIPS '22, pp.\ 24824--24837. Curran Associates Inc., 2022. ISBN 978-1-7138-7108-8
work page 2022
-
[71]
DevGPT : Studying Developer-ChatGPT Conversations
Tao Xiao, Christoph Treude, Hideaki Hata, and Kenichi Matsumoto. DevGPT : Studying Developer-ChatGPT Conversations . In Proceedings of the 21st International Conference on Mining Software Repositories , pp.\ 227--230, 2024. doi:10.1145/3643991.3648400. URL http://arxiv.org/abs/2309.03914
-
[72]
CERT : Continual Pre-training on Sketches for Library-oriented Code Generation
Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. CERT : Continual Pre-training on Sketches for Library-oriented Code Generation . In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence , pp.\ 2369--2375. International Joint Conferences on Artificial Inte...
-
[73]
Private- Library-Oriented Code Generation with Large Language Models , 2023
Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. Private- Library-Oriented Code Generation with Large Language Models , 2023. URL http://arxiv.org/abs/2307.15370
-
[74]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. Siren's Song in the AI Ocean : A Survey on Hallucination in Large Language Models . 2023. doi:10.48550/ARXIV.2309.01219. URL https://arxiv.org/abs/2309.01219
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.01219 2023
-
[75]
LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , 2024
Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. LLM Hallucinations in Practical Code Generation : Phenomena , Mechanism , and Mitigation , 2024. URL https://arxiv.org/abs/2409.20550v1
-
[76]
Retrieval-Augmented Generation for AI-Generated Content: A Survey
Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval- Augmented Generation for AI-Generated Content : A Survey , 2024. URL http://arxiv.org/abs/2402.19473
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou. Take a Step Back : Evoking Reasoning via Abstraction in Large Language Models . In 14th International Conference on Learning Representations ( ICLR24 ) . arXiv, 2024. doi:10.48550/arXiv.2310.06117. URL http://arxiv.org/abs/2310.06117
-
[78]
Li Zhong and Zilong Wang. Can LLM replace stack overflow? a study on robustness and reliability of large language model code generation. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artific...
-
[79]
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen Gong, Thong Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.15877 2024
-
[80]
Identifying and Mitigating API Misuse in Large Language Models , 2025
Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. Identifying and Mitigating API Misuse in Large Language Models , 2025. URL http://arxiv.org/abs/2503.22821
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.