LibEvoBench benchmark shows LLMs are version-oblivious on evolving APIs, with documentation helping but version specification not.
Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
PowerCodeBench and a boundary-aware intervention raise LLM accuracy on power-system code generation by 32-56 points across ten open-weight models and four commercial APIs on a 2,000-task benchmark.
Stale repository context in code RAG actively induces models to produce obsolete helper references, raising stale outputs by 76-88 percentage points over current-only retrieval in a 17-sample diagnostic study.
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
Static analysis tools detect 14-85% of library hallucinations in LLM code but are limited to at most 48.5-77% coverage even in ideal cases.
citing papers explorer
-
An Empirical Analysis of Static Analysis Methods for Detection and Mitigation of Code Library Hallucinations
Static analysis tools detect 14-85% of library hallucinations in LLM code but are limited to at most 48.5-77% coverage even in ideal cases.