Recognition: unknown
Hallucination Inspector: A Fact-Checking Judge for API Migration
Pith reviewed 2026-05-10 00:47 UTC · model grok-4.3
The pith
Hallucination Inspector detects invented symbols in LLM-generated API migration code using static analysis against documentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scaffolding hallucination occurs when LLMs generate incorrect calling contexts for new APIs by inventing phantom symbols such as imaginary imports, constructors, and constants that do not exist in the API specification. Standard metrics cannot be relied upon to detect these instances. Hallucination Inspector is a static analysis tool that extracts symbols from the abstract syntax tree of the generated code and checks them against a knowledge base derived directly from the software documentation. In a preliminary evaluation on Android API migrations, the approach successfully identifies hallucinations and significantly reduces false positives compared to standard metrics and probabilisticjudg
What carries the argument
Hallucination Inspector, a static analysis framework that extracts symbols from the abstract syntax tree of generated code and validates each against a knowledge base built directly from API documentation to flag phantom symbols.
Load-bearing premise
The knowledge base built from software documentation is complete and accurate for all valid symbols, and the static AST extraction will always correctly identify every relevant symbol without missing errors or incorrectly flagging valid code.
What would settle it
A counterexample would be an LLM-generated migration snippet containing a phantom symbol that the inspector fails to flag as invalid, or a correct code snippet that the inspector incorrectly marks as containing a hallucination.
Figures
read the original abstract
Large Language Models (LLMs) are increasingly deployed in automated software engineering for tasks such as API migration. While LLMs are able to identify migration patterns, they often make mistakes and fail to produce correct glue code to invoke the new API in place of the old one. We call this issue Scaffolding Hallucination, a failure mode where models generate incorrect calling contexts by inventing Phantom Symbols -- such as imaginary imports, constructors, and constants -- that do not exist in the API specification. In this paper, we show that standard metrics cannot be relied upon to detect these instances of hallucination. We propose Hallucination Inspector, a static analysis tool to detect Scaffolding Hallucination in LLM-generated code. Our approach includes a lightweight evaluation framework that verifies symbols extracted from the abstract syntax tree against a knowledge base derived directly from software documentation for the API. A preliminary evaluation on Android API migrations demonstrates that our approach successfully identifies hallucinations and significantly reduces false positives compared to standard metrics and probabilistic judges
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper defines Scaffolding Hallucination as a failure mode in LLM-generated API migration code where phantom symbols (imaginary imports, constructors, constants) are invented that do not exist in the API specification. It proposes Hallucination Inspector, a static analysis tool that extracts symbols via AST and verifies them against a knowledge base built directly from software documentation. A preliminary evaluation on Android API migrations is claimed to show that the tool successfully identifies hallucinations and significantly reduces false positives relative to standard metrics and probabilistic judges.
Significance. If the evaluation results hold, the work provides a lightweight, documentation-grounded static checker that targets a concrete hallucination pattern in automated API migration. This could improve reliability of LLM-based code generation tools without requiring additional model training. The direct derivation of the knowledge base from documentation is a clear methodological strength that avoids circular reliance on the same LLM.
major comments (2)
- [Abstract / Preliminary Evaluation] The central claim in the abstract that the approach 'significantly reduces false positives' is not supported by any quantitative evidence. No false-positive rates, precision/recall figures, comparison tables, or baseline numbers are supplied, so the headline result cannot be assessed against the paper's own data.
- [Approach / Knowledge Base Construction] The validity of the false-positive reduction claim rests on the knowledge base being complete for all valid symbols that appear in correct migrations. No coverage audit, cross-check against the paper's own migration examples, or handling of version-specific or undocumented symbols is described; incomplete coverage would cause legitimate code to be flagged, directly undermining the reported improvement over baselines.
minor comments (2)
- [Abstract] The abstract introduces 'Phantom Symbols' and 'Scaffolding Hallucination' without a concise example or formal definition; adding one would improve immediate clarity for readers.
- [Preliminary Evaluation] The paper refers to 'standard metrics and probabilistic judges' without naming the specific baselines used in the Android evaluation; explicit citation or description of these comparators is needed for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our paper. We address the major comments point by point below, agreeing to make revisions where the evaluation claims require stronger support and additional methodological details.
read point-by-point responses
-
Referee: [Abstract / Preliminary Evaluation] The central claim in the abstract that the approach 'significantly reduces false positives' is not supported by any quantitative evidence. No false-positive rates, precision/recall figures, comparison tables, or baseline numbers are supplied, so the headline result cannot be assessed against the paper's own data.
Authors: We agree that the abstract's claim would benefit from quantitative backing to allow proper assessment. Although the preliminary evaluation section discusses comparisons to standard metrics and probabilistic judges with some illustrative examples, explicit numerical results such as false positive rates are not detailed in the current draft. In the revised manuscript, we will include a table with precision, recall, and false positive reduction metrics from our Android API migration tests, and update the abstract to reference these findings more precisely. revision: yes
-
Referee: [Approach / Knowledge Base Construction] The validity of the false-positive reduction claim rests on the knowledge base being complete for all valid symbols that appear in correct migrations. No coverage audit, cross-check against the paper's own migration examples, or handling of version-specific or undocumented symbols is described; incomplete coverage would cause legitimate code to be flagged, directly undermining the reported improvement over baselines.
Authors: The knowledge base is derived directly from the official Android documentation by extracting classes, methods, and constants. We did perform internal checks to ensure coverage for the symbols in our test migrations, but these were not documented in the paper. We will add a new subsection in the Approach section detailing the KB construction process, including a coverage analysis for the evaluated APIs, how version-specific symbols are handled by selecting the documentation for the target API version, and acknowledgment of potential gaps for undocumented or deprecated symbols. This will strengthen the justification for the false-positive reduction. revision: yes
Circularity Check
No circularity: tool description and preliminary evaluation are self-contained with no derivations, fits, or self-referential reductions.
full rationale
The paper proposes a static-analysis tool (Hallucination Inspector) that extracts symbols via AST and checks them against a documentation-derived knowledge base, then reports a preliminary Android evaluation showing reduced false positives versus baselines. No equations, parameters, or predictions appear; the evaluation is an empirical demonstration rather than a closed-form result that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim rests on the tool's design and the evaluation data rather than any definitional loop or fitted-input renaming. This is a standard tool-description paper whose validity hinges on external validation of the KB and evaluation set, not on internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption API documentation provides a complete and accurate enumeration of all valid symbols (imports, constructors, constants, etc.) for the target API.
invented entities (2)
-
Phantom Symbols
no independent evidence
-
Scaffolding Hallucination
no independent evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma
-
[3]
InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering(Clarion Hotel Trondheim, Trondheim, Norway)(FSE Companion ’25)
Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering(Clarion Hotel Trondheim, Trondheim, Norway)(FSE Companion ’25). Association for Computing Machinery, New York, NY, USA, 468–479
-
[4]
Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, et al
-
[5]
What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study.Science China Information Sciences69 (2026)
2026
-
[6]
Mattia Fazzini, Qi Xin, and Alessandro Orso. 2020. Apimigrator: an api-usage migration tool for android apps. InProceedings of the IEEE/ACM 7th International Conference on Mobile Software Engineering and Systems. 77–80
2020
-
[7]
Stefanus A Haryono, Ferdian Thung, David Lo, Lingxiao Jiang, Julia Lawall, Hong Jin Kang, Lucas Serrano, and Gilles Muller. 2022. AndroEvolve: Auto- mated Android API update with data flow analysis and variable denormalization. Empirical Software Engineering27, 3 (2022), 1–35
2022
-
[8]
Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2025. On Mitigating Code LLM Hallucinations with API Documentation. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 237–248
2025
- [9]
-
[10]
Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, Syed Ashraf, and Paul Denny. 2025. Evaluating language models for generating and judging programming feedback. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1. 624–630
2025
-
[11]
Yuanpeng Li, Qi Long, Zhiyuan Yao, Jian Xu, Lintao Xie, Xu He, Lu Geng, Xin Han, Yueyan Chen, and Wenbo Duan. 2025. BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice.2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)(2025), 3486–3497
2025
- [12]
- [13]
- [14]
-
[15]
Stoyan Nikolov, Daniele Codecasa, Anna Sjövall, Maxim Tabachnyk, Satish Chan- dra, Siddharth Taneja, and Celal Ziftci. 2025. How is Google Using AI for Internal Code Migrations?. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 481–492
2025
-
[16]
Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced by large language models while translating code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...
2024
-
[17]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. 2025. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty- second International Conference on Machine Learning
2025
- [18]
-
[19]
Daniel Ramos, Hailie Mitchell, Inês Lynce, Vasco Manquinho, Ruben Martins, and Claire Le Goues. 2023. MELT: Mining Effective Lightweight Transformations from Pull Requests. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1516–1528
2023
-
[20]
Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]
work page internal anchor Pith review arXiv 2020
-
[21]
Marcos Tileria, Jorge Blasco, and Santanu Kumar Dash. 2024. DocFlow: Extracting taint specifications from software documentation. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12
2024
- [22]
- [23]
-
[24]
Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages
2025
-
[25]
Moshi Wei, Nima Shiri Harzevili, YueKai Huang, Jinqiu Yang, Junjie Wang, and Song Wang. [n. d.]. Demystifying and Detecting Misuses of Deep Learning APIs. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)
-
[26]
Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503
2025
-
[27]
Yanjie Zhao, Li Li, Kui Liu, and John C. Grundy. 2022. Towards Automatically Repairing Compatibility Issues in Published Android Apps. In44th IEEE/ACM 44th International Conference on Software Engineering (ICSE). 2142–2153
2022
-
[28]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623
2023
-
[29]
Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering(2026), 1–19
2026
-
[30]
2025.Migrating Code At Scale With LLMs At Google
Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim. 2025.Migrating Code At Scale With LLMs At Google. Association for Computing Machinery, New York, NY, USA, 162–173
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.