pith. machine review for the scientific record. sign in

arxiv: 2604.20202 · v1 · submitted 2026-04-22 · 💻 cs.SE

Recognition: unknown

Hallucination Inspector: A Fact-Checking Judge for API Migration

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:47 UTC · model grok-4.3

classification 💻 cs.SE
keywords scaffolding hallucinationAPI migrationLLM code generationstatic analysisphantom symbolshallucination detectionAndroid APIfact checking
0
0 comments X

The pith

Hallucination Inspector detects invented symbols in LLM-generated API migration code using static analysis against documentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often invent non-existent symbols when generating code to migrate between APIs, a problem the paper terms scaffolding hallucination. Standard evaluation metrics fail to catch these errors because they do not check against actual API specifications. The authors introduce Hallucination Inspector, which parses the generated code's abstract syntax tree to extract symbols and verifies them against a knowledge base built from the API's official documentation. A preliminary test on Android API migrations shows this method identifies the hallucinations and lowers false positive rates compared to common metrics and probabilistic judges. This matters because reliable detection is needed before LLMs can be trusted for automated code changes in real software projects.

Core claim

Scaffolding hallucination occurs when LLMs generate incorrect calling contexts for new APIs by inventing phantom symbols such as imaginary imports, constructors, and constants that do not exist in the API specification. Standard metrics cannot be relied upon to detect these instances. Hallucination Inspector is a static analysis tool that extracts symbols from the abstract syntax tree of the generated code and checks them against a knowledge base derived directly from the software documentation. In a preliminary evaluation on Android API migrations, the approach successfully identifies hallucinations and significantly reduces false positives compared to standard metrics and probabilisticjudg

What carries the argument

Hallucination Inspector, a static analysis framework that extracts symbols from the abstract syntax tree of generated code and validates each against a knowledge base built directly from API documentation to flag phantom symbols.

Load-bearing premise

The knowledge base built from software documentation is complete and accurate for all valid symbols, and the static AST extraction will always correctly identify every relevant symbol without missing errors or incorrectly flagging valid code.

What would settle it

A counterexample would be an LLM-generated migration snippet containing a phantom symbol that the inspector fails to flag as invalid, or a correct code snippet that the inspector incorrectly marks as containing a hallucination.

Figures

Figures reproduced from arXiv: 2604.20202 by Earl T. Barr, Marcos Tileria, Profir-Petru P\^ar\c{t}achi, Santanu Kumar Dash.

Figure 1
Figure 1. Figure 1: The Architecture of the Hallucination Inspector. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CodeBLEU score distribution for LLM migrations [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly deployed in automated software engineering for tasks such as API migration. While LLMs are able to identify migration patterns, they often make mistakes and fail to produce correct glue code to invoke the new API in place of the old one. We call this issue Scaffolding Hallucination, a failure mode where models generate incorrect calling contexts by inventing Phantom Symbols -- such as imaginary imports, constructors, and constants -- that do not exist in the API specification. In this paper, we show that standard metrics cannot be relied upon to detect these instances of hallucination. We propose Hallucination Inspector, a static analysis tool to detect Scaffolding Hallucination in LLM-generated code. Our approach includes a lightweight evaluation framework that verifies symbols extracted from the abstract syntax tree against a knowledge base derived directly from software documentation for the API. A preliminary evaluation on Android API migrations demonstrates that our approach successfully identifies hallucinations and significantly reduces false positives compared to standard metrics and probabilistic judges

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper defines Scaffolding Hallucination as a failure mode in LLM-generated API migration code where phantom symbols (imaginary imports, constructors, constants) are invented that do not exist in the API specification. It proposes Hallucination Inspector, a static analysis tool that extracts symbols via AST and verifies them against a knowledge base built directly from software documentation. A preliminary evaluation on Android API migrations is claimed to show that the tool successfully identifies hallucinations and significantly reduces false positives relative to standard metrics and probabilistic judges.

Significance. If the evaluation results hold, the work provides a lightweight, documentation-grounded static checker that targets a concrete hallucination pattern in automated API migration. This could improve reliability of LLM-based code generation tools without requiring additional model training. The direct derivation of the knowledge base from documentation is a clear methodological strength that avoids circular reliance on the same LLM.

major comments (2)
  1. [Abstract / Preliminary Evaluation] The central claim in the abstract that the approach 'significantly reduces false positives' is not supported by any quantitative evidence. No false-positive rates, precision/recall figures, comparison tables, or baseline numbers are supplied, so the headline result cannot be assessed against the paper's own data.
  2. [Approach / Knowledge Base Construction] The validity of the false-positive reduction claim rests on the knowledge base being complete for all valid symbols that appear in correct migrations. No coverage audit, cross-check against the paper's own migration examples, or handling of version-specific or undocumented symbols is described; incomplete coverage would cause legitimate code to be flagged, directly undermining the reported improvement over baselines.
minor comments (2)
  1. [Abstract] The abstract introduces 'Phantom Symbols' and 'Scaffolding Hallucination' without a concise example or formal definition; adding one would improve immediate clarity for readers.
  2. [Preliminary Evaluation] The paper refers to 'standard metrics and probabilistic judges' without naming the specific baselines used in the Android evaluation; explicit citation or description of these comparators is needed for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our paper. We address the major comments point by point below, agreeing to make revisions where the evaluation claims require stronger support and additional methodological details.

read point-by-point responses
  1. Referee: [Abstract / Preliminary Evaluation] The central claim in the abstract that the approach 'significantly reduces false positives' is not supported by any quantitative evidence. No false-positive rates, precision/recall figures, comparison tables, or baseline numbers are supplied, so the headline result cannot be assessed against the paper's own data.

    Authors: We agree that the abstract's claim would benefit from quantitative backing to allow proper assessment. Although the preliminary evaluation section discusses comparisons to standard metrics and probabilistic judges with some illustrative examples, explicit numerical results such as false positive rates are not detailed in the current draft. In the revised manuscript, we will include a table with precision, recall, and false positive reduction metrics from our Android API migration tests, and update the abstract to reference these findings more precisely. revision: yes

  2. Referee: [Approach / Knowledge Base Construction] The validity of the false-positive reduction claim rests on the knowledge base being complete for all valid symbols that appear in correct migrations. No coverage audit, cross-check against the paper's own migration examples, or handling of version-specific or undocumented symbols is described; incomplete coverage would cause legitimate code to be flagged, directly undermining the reported improvement over baselines.

    Authors: The knowledge base is derived directly from the official Android documentation by extracting classes, methods, and constants. We did perform internal checks to ensure coverage for the symbols in our test migrations, but these were not documented in the paper. We will add a new subsection in the Approach section detailing the KB construction process, including a coverage analysis for the evaluated APIs, how version-specific symbols are handled by selecting the documentation for the target API version, and acknowledgment of potential gaps for undocumented or deprecated symbols. This will strengthen the justification for the false-positive reduction. revision: yes

Circularity Check

0 steps flagged

No circularity: tool description and preliminary evaluation are self-contained with no derivations, fits, or self-referential reductions.

full rationale

The paper proposes a static-analysis tool (Hallucination Inspector) that extracts symbols via AST and checks them against a documentation-derived knowledge base, then reports a preliminary Android evaluation showing reduced false positives versus baselines. No equations, parameters, or predictions appear; the evaluation is an empirical demonstration rather than a closed-form result that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text. The central claim rests on the tool's design and the evaluation data rather than any definitional loop or fitted-input renaming. This is a standard tool-description paper whose validity hinges on external validation of the KB and evaluation set, not on internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the assumption that official API documentation supplies a complete, up-to-date set of valid symbols and that the static checker can map generated code symbols to this set without ambiguity or omission.

axioms (1)
  • domain assumption API documentation provides a complete and accurate enumeration of all valid symbols (imports, constructors, constants, etc.) for the target API.
    The verification step in the proposed framework relies on this to label symbols as phantom or real.
invented entities (2)
  • Phantom Symbols no independent evidence
    purpose: To name the invented non-existent symbols (imports, constructors, constants) that LLMs insert into migration code.
    New term introduced to characterize the hallucination failure mode.
  • Scaffolding Hallucination no independent evidence
    purpose: To label the specific failure mode in which LLMs generate incorrect calling contexts by inventing symbols absent from the API specification.
    New failure-mode category defined in the paper.

pith-pipeline@v0.9.0 · 5487 in / 1471 out tokens · 39676 ms · 2026-05-10T00:47:44.865289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    Nishil Amin, Zhiwei Fei, Xiang Li, Justyna Petke, and He Ye. 2026. JMigBench: A Benchmark for Evaluating LLMs on Source Code Migration (Java 8 to Java 11). arXiv preprint arXiv:2602.09930(2026)

  2. [2]

    Yujia Chen, Mingyu Chen, Cuiyun Gao, Zhihan Jiang, Zhongqi Li, and Yuchi Ma

  3. [3]

    InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering(Clarion Hotel Trondheim, Trondheim, Norway)(FSE Companion ’25)

    Towards Mitigating API Hallucination in Code Generated by LLMs with Hierarchical Dependency Aware. InProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering(Clarion Hotel Trondheim, Trondheim, Norway)(FSE Companion ’25). Association for Computing Machinery, New York, NY, USA, 468–479

  4. [4]

    Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Weikang Zhou, et al

  5. [5]

    What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study.Science China Information Sciences69 (2026)

  6. [6]

    Mattia Fazzini, Qi Xin, and Alessandro Orso. 2020. Apimigrator: an api-usage migration tool for android apps. InProceedings of the IEEE/ACM 7th International Conference on Mobile Software Engineering and Systems. 77–80

  7. [7]

    Stefanus A Haryono, Ferdian Thung, David Lo, Lingxiao Jiang, Julia Lawall, Hong Jin Kang, Lucas Serrano, and Gilles Muller. 2022. AndroEvolve: Auto- mated Android API update with data flow analysis and variable denormalization. Empirical Software Engineering27, 3 (2022), 1–35

  8. [8]

    Nihal Jain, Robert Kwiatkowski, Baishakhi Ray, Murali Krishna Ramanathan, and Varun Kumar. 2025. On Mitigating Code LLM Hallucinations with API Documentation. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 237–248

  9. [9]

    Haolin Jin and Huaming Chen. 2025. Uncovering Systematic Failures of LLMs in Verifying Code Against Natural Language Specifications. 3819–3823. doi:10. 1109/ASE63991.2025.00323

  10. [10]

    Charles Koutcheme, Nicola Dainese, Sami Sarsa, Arto Hellas, Juho Leinonen, Syed Ashraf, and Paul Denny. 2025. Evaluating language models for generating and judging programming feedback. InProceedings of the 56th ACM Technical Symposium on Computer Science Education V. 1. 624–630

  11. [11]

    Yuanpeng Li, Qi Long, Zhiyuan Yao, Jian Xu, Lintao Xie, Xu He, Lu Geng, Xin Han, Yueyan Chen, and Wenbo Duan. 2025. BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice.2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE)(2025), 3486–3497

  12. [12]

    Chandra Maddila, Adam Tait, Claire Chang, Daniel Cheng, Nauman Ahmad, et al . 2025. Agentic Program Repair from Test Failures at Scale: A Neuro- symbolic approach with static analysis and test execution feedback.arXiv preprint arXiv:2507.18755(2025)

  13. [13]

    Tarek Mahmud, Bin Duan, Meiru Che, Awatif Yasmin, Anne HH Ngu, and Guowei Yang. 2024. Automated Update of Android Deprecated API Usages with Large Language Models.arXiv preprint arXiv:2411.04387(2024)

  14. [14]

    Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, and Kyomin Jung. 2025. Don’t Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation.arXiv preprint arXiv:2505.16222(2025)

  15. [15]

    Stoyan Nikolov, Daniele Codecasa, Anna Sjövall, Maxim Tabachnyk, Satish Chan- dra, Siddharth Taneja, and Celal Ziftci. 2025. How is Google Using AI for Internal Code Migrations?. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 481–492

  16. [16]

    Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lam- bert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2024. Lost in translation: A study of bugs introduced by large language models while translating code. InProceedings of the IEEE/ACM 46th International Conference on Software Enginee...

  17. [17]

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. 2025. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty- second International Conference on Machine Learning

  18. [18]

    Bibek Paudel, Alexander Lyzhov, Preetam Joshi, and Puneet Anand. 2025. Hallu- cinot: Hallucination detection through context and common knowledge verifica- tion.arXiv preprint arXiv:2504.07069(2025)

  19. [19]

    Daniel Ramos, Hailie Mitchell, Inês Lynce, Vasco Manquinho, Ruben Martins, and Claire Le Goues. 2023. MELT: Mining Effective Lightweight Transformations from Pull Requests. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1516–1528

  20. [20]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]

  21. [21]

    Marcos Tileria, Jorge Blasco, and Santanu Kumar Dash. 2024. DocFlow: Extracting taint specifications from software documentation. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12

  22. [22]

    Lukas Twist, Jie M Zhang, Mark Harman, and Helen Yannakoudakis. 2025. Library Hallucinations in LLMs: Risk Analysis Grounded in Developer Queries.arXiv preprint arXiv:2509.22202(2025)

  23. [23]

    Chong Wang, Kaifeng Huang, Jian Zhang, Yebo Feng, Lyuye Zhang, Yang Liu, and Xin Peng. 2024. Llms meet library evolution: Evaluating deprecated api usage in llm-based code completion.arXiv preprint arXiv:2406.09834(2024)

  24. [24]

    Ruiqi Wang, Jiyu Guo, Cuiyun Gao, Guodong Fan, Chun Yong Chong, and Xin Xia. 2025. Can LLMs Replace Human Evaluators? An Empirical Study of LLM- as-a-Judge in Software Engineering.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA086 (June 2025), 23 pages

  25. [25]

    Moshi Wei, Nima Shiri Harzevili, YueKai Huang, Jinqiu Yang, Junjie Wang, and Song Wang. [n. d.]. Demystifying and Detecting Misuses of Deep Learning APIs. In2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE)

  26. [26]

    Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering2, ISSTA (2025), 481–503

  27. [27]

    Yanjie Zhao, Li Li, Kui Liu, and John C. Grundy. 2022. Towards Automatically Repairing Compatibility Issues in Published Android Apps. In44th IEEE/ACM 44th International Conference on Software Engineering (ICSE). 2142–2153

  28. [28]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems36 (2023), 46595–46623

  29. [29]

    Terry Yue Zhuo, Junda He, Jiamou Sun, Zhenchang Xing, David Lo, John Grundy, and Xiaoning Du. 2026. Identifying and Mitigating API Misuse in Large Language Models.IEEE Transactions on Software Engineering(2026), 1–19

  30. [30]

    2025.Migrating Code At Scale With LLMs At Google

    Celal Ziftci, Stoyan Nikolov, Anna Sjövall, Bo Kim, Daniele Codecasa, and Max Kim. 2025.Migrating Code At Scale With LLMs At Google. Association for Computing Machinery, New York, NY, USA, 162–173