A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
GitHub Typo Corpus : A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.SE 2years
2025 2verdicts
UNVERDICTED 2representative citing papers
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.
citing papers explorer
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%, and introduces LibHalluBench for evaluation.
-
Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
Clotho ranks LLM test inputs by failure likelihood using pre-generation hidden states and GMMs, achieving 0.716 ROC-AUC after labeling 5.4% of inputs on average across eight tasks and three models, with transfer to proprietary models.