Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

Abdallah Ayoub; Ahmad Abdel Latif; Gouri Ginde; Hinduja Nirujan; Shreyas Patil

arxiv: 2605.24137 · v1 · pith:PCSAMLJCnew · submitted 2026-05-22 · 💻 cs.SE · cs.AI

Empirical Analysis and Detection of Hallucinations in LLM-Generated Bug Report Summaries

Hinduja Nirujan , Shreyas Patil , Abdallah Ayoub , Ahmad Abdel Latif , Gouri Ginde This is my paper

Pith reviewed 2026-06-30 14:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords hallucination detectionbug report summarizationLLM outputssection-aware analysissoftware maintenancesynthetic benchmarkjoint prediction

0 comments

The pith

A section-aware model jointly detects whether LLM bug summaries contain hallucinations, which sections are affected, and what type of error occurred.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates hallucinations in LLM-generated structured bug report summaries that include steps-to-reproduce, actual behavior, and expected behavior sections. It creates a benchmark by injecting controlled synthetic hallucinations into existing reports and trains models to perform three related tasks at once. A sympathetic reader would care because undetected fabrications or omissions can mislead developers who rely on automated summaries. The experiments show the best model reaching 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. The work also examines common error patterns to expose limits of current LLM outputs in software maintenance.

Core claim

The central claim is that a section-aware hallucination detection approach, which jointly predicts report-level presence of hallucinations, identifies the affected sections, and classifies hallucination types, delivers strong performance on a benchmark built from structured bug reports.

What carries the argument

The section-aware detection model that performs joint prediction over report-level hallucination, section identification, and hallucination-type classification.

If this is right

Detection performance improves when the model accounts for the distinct sections of a bug report rather than treating the entire summary as one block.
Common hallucination patterns identified in the analysis point to specific failure modes that future LLM summarizers can target.
The three-task joint prediction setup can be applied to other structured technical documents that require section-level reliability checks.
Higher detection scores at report, section, and type levels support more trustworthy use of LLMs inside software maintenance workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthetic benchmark approximates real cases, the same joint-prediction architecture could be retrained on other domains such as API documentation or test-case descriptions.
Failure-mode analysis in the paper suggests that section boundaries themselves may act as natural checkpoints for reducing hallucination risk during generation.
Extending the approach to streaming or incremental summaries would require only modest changes to the section-aware input encoding.

Load-bearing premise

The controlled synthetic hallucination injection produces a benchmark representative of the hallucinations that LLMs generate in real structured bug report summaries.

What would settle it

Running the trained detector on fresh LLM-generated summaries drawn directly from real bug reports that contain no synthetic injections would show whether the reported performance numbers hold outside the constructed benchmark.

Figures

Figures reproduced from arXiv: 2605.24137 by Abdallah Ayoub, Ahmad Abdel Latif, Gouri Ginde, Hinduja Nirujan, Shreyas Patil.

**Figure 3.** Figure 3: Our proposed section-aware hallucination detection approach and study design [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Prompt template used to convert structured bug reports to unstructured [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Average section-wise attention across transformer models. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Quality analysis of generated bug report descriptions. (a) Distribu [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Frequencies of manual explanation labels from the qualitative error [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly used to generate summaries of software bug reports, including sections such as Steps-to-Reproduce (S2R), Actual Behavior (AB), and Expected Behavior (EB). However, these models frequently produce hallucinations that can be convincing but unsupported by the source report. This can mislead developers and reduce trust in automated maintenance tools. Existing hallucination detection approaches typically evaluate outputs at the full-response level and do not consider the structure of technical documents. An initial exploratory study on 80 structured bug report summaries found that approximately 47.9% contained missing information, while 12.3% included fabricated content, highlighting the need for systematic hallucination analysis in bug report summarization. In this work, we empirically investigate hallucinations in LLM-generated bug report summaries from a section-aware perspective. Using the BugsRepo dataset, derived from Mozilla OSS projects, we introduce controlled synthetic hallucination injection to construct a benchmark for training and evaluation. We propose a section-aware hallucination detection approach that jointly predicts whether a summary contains hallucinated content, identifies affected sections, and classifies hallucination types. Experimental results across multiple pretrained language models show that the proposed approach achieves strong performance across all tasks, with the best model obtaining 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1. We further analyze common hallucination patterns and model failure modes to better understand limitations of current LLM-generated bug report summaries. The findings highlight the importance of section-aware hallucination analysis for improving the reliability of LLM-assisted bug report summarization in software maintenance workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets decent F1 on synthetic bug-report hallucinations with a section-aware multi-task setup, but the benchmark's realism is unproven.

read the letter

The core contribution is a joint model that does report-level hallucination detection, section identification, and type classification for LLM-generated bug summaries. They build a benchmark by injecting synthetic hallucinations into the BugsRepo dataset and report 0.89/0.83/0.84 Macro-F1 across the three tasks with the best pretrained model.

What works is the practical framing. Bug reports have clear sections like S2R, AB, and EB, so treating hallucination detection as section-aware makes sense for the domain. The small exploratory study on 80 real summaries (showing 47.9% missing info) gives a quick sense that the problem is common enough to matter for SE tools.

The soft spot is the synthetic data. The numbers come entirely from controlled injections, yet the abstract gives no side-by-side comparison of hallucination frequencies, placement, or phrasing against actual LLM outputs on the same reports. If the injected examples differ in distribution, the F1 scores reflect an artificial task rather than deployment conditions. No baselines or significance tests are described either.

This is for people building or evaluating LLM summarizers for bug tracking. A reader already working on hallucination detection in technical text would find the task split useful to look at, but would still need to verify the injection method before relying on the numbers.

It is worth sending for peer review. The empirical setup is straightforward and the domain application is clear; the main revision needed is stronger evidence that the benchmark matches real LLM behavior.

Referee Report

2 major / 2 minor

Summary. The paper claims that hallucinations are common in LLM-generated structured bug report summaries (exploratory study on 80 examples: 47.9% missing information, 12.3% fabricated), constructs a benchmark by controlled synthetic hallucination injection into the BugsRepo dataset, and proposes a section-aware multi-task detector that jointly predicts report-level hallucination presence, affected sections, and hallucination types. It reports strong performance across pretrained language models, with the best achieving 0.89 report-level Macro-F1, 0.83 section-level Macro-F1, and 0.84 hallucination-type Macro-F1, plus analysis of patterns and failure modes.

Significance. If the synthetic benchmark is representative, the work would offer a practical, structure-aware method for detecting hallucinations in a high-stakes software engineering task, with the joint prediction formulation and explicit pattern analysis as clear strengths. The empirical focus on bug reports is timely for LLM-assisted maintenance tools.

major comments (2)

[Abstract / Methodology] Abstract and methodology section on benchmark construction: the headline Macro-F1 scores (0.89/0.83/0.84) are obtained exclusively on synthetically injected hallucinations. The exploratory study reports specific percentages for missing and fabricated content, yet no calibration, side-by-side frequency comparison, human realism ratings, or ablation on injection parameters is described to show that the synthetic distribution matches the section placement, linguistic features, or type frequencies of hallucinations actually emitted by the evaluated LLMs on real bug reports.
[Experimental results] Experimental results section: the abstract states specific F1 scores across multiple pretrained language models but provides no details on the models, baselines, injection methodology parameters, dataset splits, or statistical significance tests. Without these, it is impossible to assess whether the reported performance is robust or reproducible.

minor comments (2)

[Model description] The description of the three prediction tasks would benefit from an explicit equation or diagram showing how the joint loss is formulated.
[Related work] Related work could more explicitly contrast the proposed section-aware approach with prior full-response hallucination detectors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying our approach and committing to revisions where appropriate to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract / Methodology] Abstract and methodology section on benchmark construction: the headline Macro-F1 scores (0.89/0.83/0.84) are obtained exclusively on synthetically injected hallucinations. The exploratory study reports specific percentages for missing and fabricated content, yet no calibration, side-by-side frequency comparison, human realism ratings, or ablation on injection parameters is described to show that the synthetic distribution matches the section placement, linguistic features, or type frequencies of hallucinations actually emitted by the evaluated LLMs on real bug reports.

Authors: The exploratory study on 80 real LLM-generated summaries was intended only to motivate the prevalence of hallucinations (47.9% missing, 12.3% fabricated), while the synthetic injection on BugsRepo was chosen to enable controlled ground-truth labels at scale for training and evaluation. We did not include explicit calibration, human realism ratings, or frequency matching because the primary contribution is the section-aware joint detection method rather than a claim of distributional equivalence. We agree this is a limitation and will revise the methodology section to add: (1) explicit injection parameters and ablation results, (2) a qualitative comparison of synthetic vs. observed real hallucination patterns, and (3) discussion of why full calibration was not performed. This will be marked as a limitation in the revised version. revision: yes
Referee: [Experimental results] Experimental results section: the abstract states specific F1 scores across multiple pretrained language models but provides no details on the models, baselines, injection methodology parameters, dataset splits, or statistical significance tests. Without these, it is impossible to assess whether the reported performance is robust or reproducible.

Authors: The full manuscript's experimental results section does specify the pretrained language models evaluated, the BugsRepo dataset splits, and the controlled injection procedure. However, we acknowledge that the abstract is too terse and that additional details on baselines, exact hyper-parameters, and statistical significance testing are missing or insufficiently prominent. We will expand the experimental setup and results subsections to include: model names and sizes, baseline comparisons, injection parameter values, train/validation/test splits, and any significance tests performed. These additions will make the results fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper is an empirical classification study. It constructs a benchmark by injecting synthetic hallucinations into the BugsRepo dataset, trains section-aware detectors on pretrained language models, and reports standard Macro-F1 scores on held-out data. No mathematical derivations, equations, or predictions are present that could reduce to fitted inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The reported performance numbers (0.89/0.83/0.84) are direct experimental outcomes on the constructed test set and do not tautologically follow from the paper's own definitions or prior self-references. The validity concern about synthetic vs. real hallucinations is a question of benchmark representativeness, not circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine learning study with no mathematical derivations, free parameters, axioms, or invented entities; relies on standard supervised classification practices and a synthetically augmented dataset.

pith-pipeline@v0.9.1-grok · 5845 in / 1181 out tokens · 53154 ms · 2026-06-30T14:58:48.576902+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 29 canonical work pages · 3 internal anchors

[1]

InSemEval, 2024

Leveraging synthetic data for llm hallucination detection. InSemEval, 2024

2024
[2]

Shamsu Abdullahi et al. The rise of hallucination in large lan- guage models: systematic reviews, performance analysis and chal- lenges.Cluster Computing, 29(2):124, Feb 2026.doi:10.1007/ s10586-025-05891-z

2026
[3]

Can we enhance bug report quality using llms?: An empirical study of llm-based bug report generation

Jagrit Acharya and Gouri Ginde. Bugsrepo: A comprehensive curated dataset of bug reports, comments and contributors information from bugzilla. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, EASE ’25, page 986–993, New York, NY , USA, 2025. Association for Computing Machinery.doi:10.1145/3756681.3756994

work page doi:10.1145/3756681.3756994 2025
[4]

Can we enhance bug report quality using llms?: An empirical study of llm-based bug report generation

Jagrit Acharya and Gouri Ginde. Can we enhance bug report quality using llms?: An empirical study of llm-based bug report generation. In Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, EASE ’25, page 994–1003, New York, NY , USA, 2025. Association for Computing Machinery.doi: 10.1145/3756681.3756995

work page doi:10.1145/3756681.3756995 2025
[5]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960. arXiv:https://doi.org/10.1177/001316446002000104, doi:10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[6]

Handling divergent reference texts when eval- uating table-to-text generation

Bhuwan Dhingra et al. Handling divergent reference texts when eval- uating table-to-text generation. In Anna Korhonen, David Traum, and Llu´ıs M`arquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895, Florence, Italy, July 2019. Association for Computational Linguistics. URL: https://aclant...

work page doi:10.18653/v1/p19-1483 2019
[7]

Large language models for software engineering: Survey and open problems

Angela Fan et al. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Confer- ence on Software Engineering: Future of Software Engineering (ICSE- F oSE), pages 31–53, 2023.doi:10.1109/ICSE-FoSE59343. 2023.00008

work page doi:10.1109/icse-fose59343 2023
[8]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K

Sen Fang et al. Representthemall: A universal learning representation of bug reports. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 602–614, 2023.doi:10.1109/ ICSE48619.2023.00060

work page arXiv 2023
[9]

A systematic literature review of code hallucinations in llms: Characterization, mitigation methods, challenges, and future directions for reliable ai, 2025

Cuiyun Gao, David Lo, et al. A systematic literature review of code hallucinations in llms: Characterization, mitigation methods, challenges, and future directions for reliable ai, 2025. URL: https://arxiv.org/abs/ 2511.00776,arXiv:2511.00776

work page arXiv 2025
[10]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024
[11]

Analyzing and predicting effort associated with finding and fixing software faults

Maggie Hamill and Katerina Goseva-Popstojanova. Analyzing and predicting effort associated with finding and fixing software faults. Information and Software Technology, 87:1–18, 2017. URL: https: //www.sciencedirect.com/science/article/pii/S0950584917300290,doi: 10.1016/j.infsof.2017.01.002

work page doi:10.1016/j.infsof.2017.01.002 2017
[12]

Hallucination detection on code generation with selfcheckgpt.Journal of Information Processing, 33:487–493, 2025.doi:10.2197/ipsjjip.33.487

Waka Ito, Yui Obara, Miyu Sato, and Kimio Kuramitsu. Hallucination detection on code generation with selfcheckgpt.Journal of Information Processing, 33:487–493, 2025.doi:10.2197/ipsjjip.33.487

work page doi:10.2197/ipsjjip.33.487 2025
[13]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Comput. Surv., 55(12), March 2023.doi:10.1145/3571730

work page doi:10.1145/3571730 2023
[14]

He Jiang, Najam Nazar, et al. Prst: A pagerank-based summariza- tion technique for summarizing bug reports with duplicates.Inter- national Journal of Software Engineering and Knowledge Engineer- ing, 27(06):869–896, 2017.arXiv:https://doi.org/10.1142/ S0218194017500322,doi:10.1142/S0218194017500322

work page doi:10.1142/s0218194017500322 2017
[15]

A systematic review of software development cost estimation studies.IEEE Transactions on Software En- gineering, 33(1):33–53, 2007.doi:10.1109/TSE.2007.256943

Magne Jorgensen and Martin Shepperd. A systematic review of software development cost estimation studies.IEEE Transactions on Software En- gineering, 33(1):33–53, 2007.doi:10.1109/TSE.2007.256943

work page doi:10.1109/tse.2007.256943 2007
[16]

Lie to me: Knowledge graphs for robust hallucination self-detection in llms, 2025

Sahil Kale and Antonio Luca Alfeo. Lie to me: Knowledge graphs for robust hallucination self-detection in llms, 2025. URL: https://arxiv.org/ abs/2512.23547,arXiv:2512.23547

work page arXiv 2025
[17]

The cost of poor software quality in the us: A 2022 report

Herb Krasner. The cost of poor software quality in the us: A 2022 report. Technical report, Consortium for Informa- tion & Software Quality, 2022. URL: https://www.it-cisq.org/ the-cost-of-poor-quality-software-in-the-us-a-2022-report/

2022
[18]

How can I parse a JSON string in Python and extract a nested field?

Arjun Krishna, Erick Galinkin, Leon Derczynski, and Jeffrey Martin. Importing phantoms: Measuring llm package hallucination vulnerabili- ties, 01 2025.doi:10.48550/arXiv.2501.19012

work page doi:10.48550/arxiv.2501.19012 2025
[19]

ROUGE: A package for automatic evaluation of sum- maries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of sum- maries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL: https://aclanthology.org/W04-1013/

2004
[20]

Hallucinations in code change to natural language generation: Prevalence and evaluation of detection metrics

Chunhua Liu et al. Hallucinations in code change to natural language generation: Prevalence and evaluation of detection metrics. InProceed- ings of the 14th International Joint Conference on Natural Language Processing, pages 2538–2560, Mumbai, India, December 2025. The Asian Federation of Natural Language Processing and The Association for Computational ...

work page doi:10.18653/v1/2025.ijcnlp-long.137 2025
[21]

Beyond functional correctness: Exploring hallucina- tions in llm-generated code.IEEE Transactions on Software Engineer- ing, 52:1037–1055, 2024.doi:10.1109/tse.2026.3657432

Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, and Yuchi Ma. Beyond functional correctness: Exploring hallucina- tions in llm-generated code.IEEE Transactions on Software Engineer- ing, 52:1037–1055, 2024.doi:10.1109/tse.2026.3657432

work page doi:10.1109/tse.2026.3657432 2024
[22]

Modelling the ‘hurried’ bug report reading process to summarize bug reports

Rafael Lotufo, Zeeshan Malik, and Krzysztof Czarnecki. Modelling the ‘hurried’ bug report reading process to summarize bug reports. Empirical Software Engineering, 20(2):516–548, Apr 2015.doi: 10.1007/s10664-014-9311-2

work page doi:10.1007/s10664-014-9311-2 2015
[23]

ETF: An entity tracing framework for hallucination detection in code summaries

Kishan Maharaj, , Pushpak Bhattacharyya, et al. ETF: An entity tracing framework for hallucination detection in code summaries. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 30639– 30652, Vienna,...

work page doi:10.18653/v1/2025.acl-long.1480 2025
[24]

Ausum: approach for unsupervised bug report summarization

Senthil Mani, Rose Catherine, Vibha Singhal Sinha, and Avinava Dubey. Ausum: approach for unsupervised bug report summarization. In Proceedings of the ACM SIGSOFT 20th International Symposium on the F oundations of Software Engineering, FSE ’12, New York, NY , USA, 2012. Association for Computing Machinery.doi:10.1145/ 2393596.2393607

work page arXiv 2012
[25]

On faithfulness and factuality in abstrac- tive summarization

Joshua Maynez et al. On faithfulness and factuality in abstrac- tive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 1906– 1919, Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.o...

1906
[26]

Selfcheck-eval: A multi-module framework for zero-resource hallucination detection in large language models

Diyana Muhammed et al. Selfcheck-eval: A multi-module framework for zero-resource hallucination detection in large language models. 2025. URL: https://api.semanticscholar.org/CorpusID:276107224

2025
[28]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceed- ings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[29]

Hallucinot: Hallucination detection through context and common knowl- edge verification.ArXiv, abs/2504.07069, 2025

Bibek Paudel, Alexander Lyzhov, Preetam Joshi, and Puneet Anand. Hallucinot: Hallucination detection through context and common knowl- edge verification.ArXiv, abs/2504.07069, 2025. URL: https://api. semanticscholar.org/CorpusID:277634150

work page arXiv 2025
[30]

Enhancing bug report summaries through knowledge-specific and contrastive learning pre-training.IEEE Access, 12:37653–37662, 2024.doi:10.1109/ACCESS.2024

Yunna Shao and Bangmeng Xiang. Enhancing bug report summaries through knowledge-specific and contrastive learning pre-training.IEEE Access, 12:37653–37662, 2024.doi:10.1109/ACCESS.2024. 3368915

work page doi:10.1109/access.2024 2024
[31]

Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems

Salma Begum Tamanna, Gias Uddin, et al.ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?, page 2290–2302. IEEE Press, 2025. URL: https://doi.org/10.1109/ ICSE55347.2025.00145

work page arXiv 2025
[32]

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

Chakkrit Kla Tantithamthavorn et al. Hallujudge: A reference-free hallucination detection for context misalignment in code review automa- tion.ArXiv, abs/2601.19072, 2026. URL: https://api.semanticscholar. org/CorpusID:285070919

work page internal anchor Pith review Pith/arXiv arXiv 2026
[33]

Marcos Tileria, Santanu Kumar Dash, Profir-Petru P ˆart ¸achi, and Earl T. Barr. Hallucination inspector: A fact-checking judge for api migration,
[34]

URL: https://arxiv.org/abs/2604.20202,arXiv:2604.20202

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL: https://arxiv.org/abs/1706.03762,arXiv: 1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Evaluating the usefulness of ir-based fault localization techniques

Qianqian Wang, Chris Parnin, and Alessandro Orso. Evaluating the usefulness of ir-based fault localization techniques. InProceedings of the 2015 international symposium on software testing and analysis, pages 1–11, 2015

2015
[37]

A comprehensive review on generative AI for education

Bangmeng Xiang and Yunna Shao. Sumllama: Efficient contrastive representations and fine-tuned adapters for bug report summarization. IEEE Access, 12:78562–78571, 2024.doi:10.1109/ACCESS. 2024.3397326

work page doi:10.1109/access 2024
[38]

Brmds: an llm-based multi-dimensional summary generation approach for bug reports.Automated Software Engg., 33(1), September 2025

Yayun Zhang, Yuying Li, Minying Fang, Xing Yuan, and Junwei Du. Brmds: an llm-based multi-dimensional summary generation approach for bug reports.Automated Software Engg., 33(1), September 2025. doi:10.1007/s10515-025-00553-1

work page doi:10.1007/s10515-025-00553-1 2025
[39]

Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2:481 – 503, 2024.doi:10.1145/3728894

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2:481 – 503, 2024.doi:10.1145/3728894

work page doi:10.1145/3728894 2024

[1] [1]

InSemEval, 2024

Leveraging synthetic data for llm hallucination detection. InSemEval, 2024

2024

[2] [2]

Shamsu Abdullahi et al. The rise of hallucination in large lan- guage models: systematic reviews, performance analysis and chal- lenges.Cluster Computing, 29(2):124, Feb 2026.doi:10.1007/ s10586-025-05891-z

2026

[3] [3]

Can we enhance bug report quality using llms?: An empirical study of llm-based bug report generation

Jagrit Acharya and Gouri Ginde. Bugsrepo: A comprehensive curated dataset of bug reports, comments and contributors information from bugzilla. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, EASE ’25, page 986–993, New York, NY , USA, 2025. Association for Computing Machinery.doi:10.1145/3756681.3756994

work page doi:10.1145/3756681.3756994 2025

[4] [4]

Can we enhance bug report quality using llms?: An empirical study of llm-based bug report generation

Jagrit Acharya and Gouri Ginde. Can we enhance bug report quality using llms?: An empirical study of llm-based bug report generation. In Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering, EASE ’25, page 994–1003, New York, NY , USA, 2025. Association for Computing Machinery.doi: 10.1145/3756681.3756995

work page doi:10.1145/3756681.3756995 2025

[5] [5]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960. arXiv:https://doi.org/10.1177/001316446002000104, doi:10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960

[6] [6]

Handling divergent reference texts when eval- uating table-to-text generation

Bhuwan Dhingra et al. Handling divergent reference texts when eval- uating table-to-text generation. In Anna Korhonen, David Traum, and Llu´ıs M`arquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895, Florence, Italy, July 2019. Association for Computational Linguistics. URL: https://aclant...

work page doi:10.18653/v1/p19-1483 2019

[7] [7]

Large language models for software engineering: Survey and open problems

Angela Fan et al. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Confer- ence on Software Engineering: Future of Software Engineering (ICSE- F oSE), pages 31–53, 2023.doi:10.1109/ICSE-FoSE59343. 2023.00008

work page doi:10.1109/icse-fose59343 2023

[8] [8]

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K

Sen Fang et al. Representthemall: A universal learning representation of bug reports. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 602–614, 2023.doi:10.1109/ ICSE48619.2023.00060

work page arXiv 2023

[9] [9]

A systematic literature review of code hallucinations in llms: Characterization, mitigation methods, challenges, and future directions for reliable ai, 2025

Cuiyun Gao, David Lo, et al. A systematic literature review of code hallucinations in llms: Characterization, mitigation methods, challenges, and future directions for reliable ai, 2025. URL: https://arxiv.org/abs/ 2511.00776,arXiv:2511.00776

work page arXiv 2025

[10] [10]

A survey on llm-as-a-judge.The Innovation, 2024

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

2024

[11] [11]

Analyzing and predicting effort associated with finding and fixing software faults

Maggie Hamill and Katerina Goseva-Popstojanova. Analyzing and predicting effort associated with finding and fixing software faults. Information and Software Technology, 87:1–18, 2017. URL: https: //www.sciencedirect.com/science/article/pii/S0950584917300290,doi: 10.1016/j.infsof.2017.01.002

work page doi:10.1016/j.infsof.2017.01.002 2017

[12] [12]

Hallucination detection on code generation with selfcheckgpt.Journal of Information Processing, 33:487–493, 2025.doi:10.2197/ipsjjip.33.487

Waka Ito, Yui Obara, Miyu Sato, and Kimio Kuramitsu. Hallucination detection on code generation with selfcheckgpt.Journal of Information Processing, 33:487–493, 2025.doi:10.2197/ipsjjip.33.487

work page doi:10.2197/ipsjjip.33.487 2025

[13] [13]

Survey of Hallucination in Natural Language Generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Comput. Surv., 55(12), March 2023.doi:10.1145/3571730

work page doi:10.1145/3571730 2023

[14] [14]

He Jiang, Najam Nazar, et al. Prst: A pagerank-based summariza- tion technique for summarizing bug reports with duplicates.Inter- national Journal of Software Engineering and Knowledge Engineer- ing, 27(06):869–896, 2017.arXiv:https://doi.org/10.1142/ S0218194017500322,doi:10.1142/S0218194017500322

work page doi:10.1142/s0218194017500322 2017

[15] [15]

A systematic review of software development cost estimation studies.IEEE Transactions on Software En- gineering, 33(1):33–53, 2007.doi:10.1109/TSE.2007.256943

Magne Jorgensen and Martin Shepperd. A systematic review of software development cost estimation studies.IEEE Transactions on Software En- gineering, 33(1):33–53, 2007.doi:10.1109/TSE.2007.256943

work page doi:10.1109/tse.2007.256943 2007

[16] [16]

Lie to me: Knowledge graphs for robust hallucination self-detection in llms, 2025

Sahil Kale and Antonio Luca Alfeo. Lie to me: Knowledge graphs for robust hallucination self-detection in llms, 2025. URL: https://arxiv.org/ abs/2512.23547,arXiv:2512.23547

work page arXiv 2025

[17] [17]

The cost of poor software quality in the us: A 2022 report

Herb Krasner. The cost of poor software quality in the us: A 2022 report. Technical report, Consortium for Informa- tion & Software Quality, 2022. URL: https://www.it-cisq.org/ the-cost-of-poor-quality-software-in-the-us-a-2022-report/

2022

[18] [18]

How can I parse a JSON string in Python and extract a nested field?

Arjun Krishna, Erick Galinkin, Leon Derczynski, and Jeffrey Martin. Importing phantoms: Measuring llm package hallucination vulnerabili- ties, 01 2025.doi:10.48550/arXiv.2501.19012

work page doi:10.48550/arxiv.2501.19012 2025

[19] [19]

ROUGE: A package for automatic evaluation of sum- maries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of sum- maries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL: https://aclanthology.org/W04-1013/

2004

[20] [20]

Hallucinations in code change to natural language generation: Prevalence and evaluation of detection metrics

Chunhua Liu et al. Hallucinations in code change to natural language generation: Prevalence and evaluation of detection metrics. InProceed- ings of the 14th International Joint Conference on Natural Language Processing, pages 2538–2560, Mumbai, India, December 2025. The Asian Federation of Natural Language Processing and The Association for Computational ...

work page doi:10.18653/v1/2025.ijcnlp-long.137 2025

[21] [21]

Beyond functional correctness: Exploring hallucina- tions in llm-generated code.IEEE Transactions on Software Engineer- ing, 52:1037–1055, 2024.doi:10.1109/tse.2026.3657432

Fang Liu, Yang Liu, Lin Shi, Zhen Yang, Li Zhang, Xiaoli Lian, Zhongqi Li, and Yuchi Ma. Beyond functional correctness: Exploring hallucina- tions in llm-generated code.IEEE Transactions on Software Engineer- ing, 52:1037–1055, 2024.doi:10.1109/tse.2026.3657432

work page doi:10.1109/tse.2026.3657432 2024

[22] [22]

Modelling the ‘hurried’ bug report reading process to summarize bug reports

Rafael Lotufo, Zeeshan Malik, and Krzysztof Czarnecki. Modelling the ‘hurried’ bug report reading process to summarize bug reports. Empirical Software Engineering, 20(2):516–548, Apr 2015.doi: 10.1007/s10664-014-9311-2

work page doi:10.1007/s10664-014-9311-2 2015

[23] [23]

ETF: An entity tracing framework for hallucination detection in code summaries

Kishan Maharaj, , Pushpak Bhattacharyya, et al. ETF: An entity tracing framework for hallucination detection in code summaries. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 30639– 30652, Vienna,...

work page doi:10.18653/v1/2025.acl-long.1480 2025

[24] [24]

Ausum: approach for unsupervised bug report summarization

Senthil Mani, Rose Catherine, Vibha Singhal Sinha, and Avinava Dubey. Ausum: approach for unsupervised bug report summarization. In Proceedings of the ACM SIGSOFT 20th International Symposium on the F oundations of Software Engineering, FSE ’12, New York, NY , USA, 2012. Association for Computing Machinery.doi:10.1145/ 2393596.2393607

work page arXiv 2012

[25] [25]

On faithfulness and factuality in abstrac- tive summarization

Joshua Maynez et al. On faithfulness and factuality in abstrac- tive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 1906– 1919, Online, July 2020. Association for Computational Linguistics. URL: https://aclanthology.o...

1906

[26] [26]

Selfcheck-eval: A multi-module framework for zero-resource hallucination detection in large language models

Diyana Muhammed et al. Selfcheck-eval: A multi-module framework for zero-resource hallucination detection in large language models. 2025. URL: https://api.semanticscholar.org/CorpusID:276107224

2025

[27] [28]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceed- ings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.doi:10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[28] [29]

Hallucinot: Hallucination detection through context and common knowl- edge verification.ArXiv, abs/2504.07069, 2025

Bibek Paudel, Alexander Lyzhov, Preetam Joshi, and Puneet Anand. Hallucinot: Hallucination detection through context and common knowl- edge verification.ArXiv, abs/2504.07069, 2025. URL: https://api. semanticscholar.org/CorpusID:277634150

work page arXiv 2025

[29] [30]

Enhancing bug report summaries through knowledge-specific and contrastive learning pre-training.IEEE Access, 12:37653–37662, 2024.doi:10.1109/ACCESS.2024

Yunna Shao and Bangmeng Xiang. Enhancing bug report summaries through knowledge-specific and contrastive learning pre-training.IEEE Access, 12:37653–37662, 2024.doi:10.1109/ACCESS.2024. 3368915

work page doi:10.1109/access.2024 2024

[30] [31]

Llm-agents driven automated simulation testing and anal- ysis of small uncrewed aerial systems

Salma Begum Tamanna, Gias Uddin, et al.ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?, page 2290–2302. IEEE Press, 2025. URL: https://doi.org/10.1109/ ICSE55347.2025.00145

work page arXiv 2025

[31] [32]

HalluJudge: A Reference-Free Hallucination Detection for Context Misalignment in Code Review Automation

Chakkrit Kla Tantithamthavorn et al. Hallujudge: A reference-free hallucination detection for context misalignment in code review automa- tion.ArXiv, abs/2601.19072, 2026. URL: https://api.semanticscholar. org/CorpusID:285070919

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [33]

Marcos Tileria, Santanu Kumar Dash, Profir-Petru P ˆart ¸achi, and Earl T. Barr. Hallucination inspector: A fact-checking judge for api migration,

[33] [34]

URL: https://arxiv.org/abs/2604.20202,arXiv:2604.20202

work page internal anchor Pith review Pith/arXiv arXiv

[34] [35]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL: https://arxiv.org/abs/1706.03762,arXiv: 1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [36]

Evaluating the usefulness of ir-based fault localization techniques

Qianqian Wang, Chris Parnin, and Alessandro Orso. Evaluating the usefulness of ir-based fault localization techniques. InProceedings of the 2015 international symposium on software testing and analysis, pages 1–11, 2015

2015

[36] [37]

A comprehensive review on generative AI for education

Bangmeng Xiang and Yunna Shao. Sumllama: Efficient contrastive representations and fine-tuned adapters for bug report summarization. IEEE Access, 12:78562–78571, 2024.doi:10.1109/ACCESS. 2024.3397326

work page doi:10.1109/access 2024

[37] [38]

Brmds: an llm-based multi-dimensional summary generation approach for bug reports.Automated Software Engg., 33(1), September 2025

Yayun Zhang, Yuying Li, Minying Fang, Xing Yuan, and Junwei Du. Brmds: an llm-based multi-dimensional summary generation approach for bug reports.Automated Software Engg., 33(1), September 2025. doi:10.1007/s10515-025-00553-1

work page doi:10.1007/s10515-025-00553-1 2025

[38] [39]

Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2:481 – 503, 2024.doi:10.1145/3728894

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering, 2:481 – 503, 2024.doi:10.1145/3728894

work page doi:10.1145/3728894 2024