arxiv: 2604.19224 · v1 · submitted 2026-04-21 · 💻 cs.SE

iCoRe: An Iterative Correlation-Aware Retriever for Bug Reproduction Test Generation

Junyi Wang , Jialun Cao , Zhongxin Liu This is my paper

Pith reviewed 2026-05-10 02:29 UTC · model grok-4.3

classification 💻 cs.SE

keywords bug reproduction testsLLM context retrievaliterative retrievalcorrelation-aware retrievalsoftware testingcode generationtest generationfunction call structure

0 comments

The pith

An iterative retriever that tracks source-test differences, semantic-structural links, and generation feedback raises LLM success at producing bug reproduction tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that retrieval methods used to give context to large language models for turning bug reports into executable tests are limited by three blind spots: they fetch source code and tests with the same rules, they score relevance only by textual meaning while missing call-graph ties, and they never update what they fetch after seeing whether the generated test actually runs. iCoRe fixes these by running an explicit loop that differentiates the two kinds of files, scores both meaning and structure together, and feeds execution outcomes back to refine the next round of retrieval. If the claim holds, developers would receive higher-quality context automatically, leading to more tests that correctly reproduce reported bugs without manual file hunting. The evaluation on two standard benchmarks shows concrete gains over prior retrieval baselines.

Core claim

iCoRe is an iterative, correlation-aware retriever that distinguishes the retrieval needs of source code from those of test cases, evaluates relevance by combining textual semantics with function-call structure, and closes a feedback loop between the retrieval stage and the subsequent test-generation stage so that execution results can refine the supplied context. When paired with an LLM generator, the approach produces Fail-to-Pass rates of 42.0 percent on SWT-bench Lite and 52.8 percent on TDD-bench Verified, for relative gains of 19.7 to 31.7 percent over existing retrieval techniques.

What carries the argument

iCoRe, the iterative loop that explicitly models three correlations (source-test differentiation, semantic-structural relevance, and retrieval-generation feedback) to supply refined context to the LLM.

If this is right

Higher-quality context directly increases the fraction of generated tests that reproduce the reported bug.
The same retrieval loop can be attached to any existing LLM-based bug reproduction pipeline.
Function-call structure becomes a usable signal alongside text similarity when ranking candidate files.
Execution feedback from failed tests can be reused to steer later retrieval rounds toward missing dependencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-correlation pattern could be applied to other LLM code tasks such as automated repair or test augmentation.
Extending the loop to incorporate richer execution traces might further reduce irrelevant context.
In large repositories the method could cut the time developers spend manually selecting files for debugging.
If the feedback loop scales, it may reduce the need for very large context windows in the generator model.

Load-bearing premise

The three named correlations are the dominant factors limiting current retrievers and that an iterative loop will improve them without introducing new noise or instability in the LLM outputs.

What would settle it

On the same two benchmarks, disable the iterative feedback component of iCoRe and check whether Fail-to-Pass rates fall back to the levels of the non-iterative baselines.

Figures

Figures reproduced from arXiv: 2604.19224 by Jialun Cao, Junyi Wang, Zhongxin Liu.

**Figure 1.** Figure 1: Motivating Example To address the above limitations, we propose a novel iterative Correlation-aware Retrieval approach, iCoRe, for automated BRT generation. For the limitation of the unified retrieval strategy, Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE186. Publication date: July 2026 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Motivating Example (Continued from Figure [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of iCoRe. relevance; and 3) the powerful feedback correlation between the generation and retrieval phases, which enables iterative refinement. By being aware of these correlations, iCoRe systematically alleviates the failures of prior approaches. 3 Approach In this section, we introduce the framework of iCoRe. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of Iteration Rounds. We conducted a sensitivity analysis on three key hyperparameters using the SWT-bench Lite dataset: the number of retrieval iterations, the number of relevant tests provided as context, and the base weight used in function similarity calculation. Impact of Iteration Rounds. We define an iteration as a cycle of generating a sketch BRT and using its feedback to refine the retriev… view at source ↗

**Figure 5.** Figure 5: Examples of Failure Cases The majority of failures occurred when the retrieved context was sufficient, but the generator failed to output a correct BRT. These failures stem from the limited reasoning capabilities of the model or the simple one-pass generation process. For example, for sympy__sympy-23413 ( Figure 5A), the generated test failed to reproduce the issue simply due to a missing numpy import. Wit… view at source ↗

read the original abstract

Automatically generating bug reproduction tests (BRT) from issue descriptions is crucial for software maintenance. LLM-based approaches have shown great potential for this task. Their effectiveness heavily relies on retrieving high-quality context from the codebase. The retrieval phase of existing approaches relies on either traditional methods like BM25 or LLM-driven strategies. LLM-based retrieval strategies typically equip an LLM with tools to autonomously explore the repository or select the most relevant files and code snippets from a provided list as context. However, these retrieval methods suffer from three key limitations: 1) They often employ a unified strategy for retrieving both source code and test cases, overlooking their distinct retrieval requirements. 2) They focus solely on semantic similarity while ignoring function call relationships, leading to irrelevant context. 3) The retrieval lacks a feedback loop from the generation phase, preventing it from refining the context based on execution results. These limitations collectively result in low-quality context, thereby hindering the accuracy of bug reproduction. To address these challenges, we propose iCoRe, an iterative, correlation-aware context retrieval approach explicitly aware of three key correlations: 1) between source code and test cases, which requires differentiated retrieval, 2) between textual semantics and function call structures for accurate relevance assessment, and 3) between the retrieval and generation phases, which enables iterative feedback and refinement. To evaluate iCoRe, we integrate it with an LLM-based BRT generator and conduct a comprehensive evaluation on the SWT-bench Lite and TDD-bench Verified benchmarks. Experimental results show that our method achieves a Fail-to-Pass rate of 42.0% and 52.8% respectively, representing 19.7%-31.7% relative improvements over existing retrieval methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

iCoRe adds differentiated source-test retrieval, call-graph awareness, and an execution feedback loop to LLM bug test generation, with concrete benchmark gains that still lack the ablations needed to tie them to those mechanisms.

read the letter

iCoRe takes existing LLM retrieval for bug reproduction tests and layers in three targeted correlations plus iteration. It splits source code from test retrieval, folds in function call structure instead of pure semantics, and feeds execution results back to refine the context. On the two public benchmarks the abstract reports 42% and 52.8% fail-to-pass rates, a 19.7-31.7% relative lift over the baselines it cites. That is the core takeaway: a practical engineering tweak that produces measurable numbers on standard suites.

Referee Report

3 major / 1 minor

Summary. The paper proposes iCoRe, an iterative correlation-aware retriever for LLM-based bug reproduction test (BRT) generation from issue descriptions. It identifies three limitations in prior retrieval approaches (unified source/test strategies, semantic-only similarity ignoring function-call structure, and absence of generation-phase feedback) and claims to address them by explicitly modeling source-test differentiation, semantic-structural relevance, and retrieval-generation feedback loops. On SWT-bench Lite and TDD-bench Verified, iCoRe integrated with an LLM generator achieves Fail-to-Pass rates of 42.0% and 52.8%, corresponding to 19.7%-31.7% relative gains over existing retrieval baselines.

Significance. If the gains can be shown to arise specifically from the three modeled correlations and the iterative refinement rather than from increased context volume or prompt variations, the work would provide a practical advance in retrieval-augmented generation for software maintenance tasks. The use of public benchmarks and concrete F2P metrics is a positive step; however, the current evidence does not yet allow confident attribution of the improvements to the claimed mechanisms.

major comments (3)

[Experimental evaluation] Experimental evaluation: The reported 19.7%-31.7% relative F2P improvements are presented without ablation studies that remove individual correlations (source-test differentiation, semantic-structural relevance, retrieval-generation feedback) or the iterative loop one at a time. This omission makes it impossible to verify that the gains are driven by the proposed mechanisms rather than by simply supplying more context or altering prompt structure.
[Results] Results reporting: No variance across runs, statistical significance tests, or per-iteration statistics (e.g., number of iterations until convergence, failure modes of the feedback loop) are provided for the 42.0% and 52.8% F2P figures. Without these, the stability of the iterative process and the reliability of the headline numbers cannot be assessed.
[Experimental evaluation] Baseline comparison: The paper does not state whether the existing retrieval baselines were given equivalent total context length or subjected to the same prompt-engineering effort as iCoRe; this leaves open the possibility that part of the observed delta is attributable to differences in input volume rather than the correlation-aware design.

minor comments (1)

[Abstract and Evaluation] The abstract and evaluation section should explicitly name the exact retrieval baselines (e.g., BM25, specific LLM-driven methods) and the precise F2P definition used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our experimental evaluation. We agree that stronger evidence is needed to attribute the observed gains specifically to the proposed correlation-aware mechanisms and iterative refinement. We address each major comment below and will revise the manuscript to incorporate additional analyses and clarifications.

read point-by-point responses

Referee: [Experimental evaluation] Experimental evaluation: The reported 19.7%-31.7% relative F2P improvements are presented without ablation studies that remove individual correlations (source-test differentiation, semantic-structural relevance, retrieval-generation feedback) or the iterative loop one at a time. This omission makes it impossible to verify that the gains are driven by the proposed mechanisms rather than by simply supplying more context or altering prompt structure.

Authors: We acknowledge that the current manuscript does not include explicit ablation studies isolating each of the three correlations and the iterative loop. To strengthen attribution, we will add a dedicated ablation section in the revised paper. This will include variants that disable source-test differentiation, semantic-structural relevance, retrieval-generation feedback, and the iterative process one at a time, while keeping total context length fixed. We will report the resulting F2P rates on both SWT-bench Lite and TDD-bench Verified and discuss how the drops confirm the contribution of each component beyond mere context volume. revision: yes
Referee: [Results] Results reporting: No variance across runs, statistical significance tests, or per-iteration statistics (e.g., number of iterations until convergence, failure modes of the feedback loop) are provided for the 42.0% and 52.8% F2P figures. Without these, the stability of the iterative process and the reliability of the headline numbers cannot be assessed.

Authors: We agree that variance, significance testing, and iteration-level diagnostics are necessary for assessing reliability. In the revision we will re-run the full iCoRe pipeline (and baselines) across at least five independent trials, reporting mean F2P rates with standard deviations. We will also include paired statistical significance tests against each baseline and add a new subsection with per-iteration statistics: average iterations to convergence, distribution of iteration counts, and qualitative analysis of feedback-loop failure cases. revision: yes
Referee: [Experimental evaluation] Baseline comparison: The paper does not state whether the existing retrieval baselines were given equivalent total context length or subjected to the same prompt-engineering effort as iCoRe; this leaves open the possibility that part of the observed delta is attributable to differences in input volume rather than the correlation-aware design.

Authors: We will revise the experimental setup and implementation details sections to explicitly document the total context-token budget and prompt templates applied to every baseline. Where the original runs did not enforce strict equivalence, we will re-evaluate the baselines under identical token limits and prompt-engineering effort, then report the updated F2P numbers alongside the original figures to demonstrate that the gains persist under matched conditions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmark comparisons

full rationale

The paper identifies three limitations in existing retrievers for bug reproduction test generation and proposes iCoRe to address them via explicit modeling of source-test differentiation, semantic-structural relevance, and retrieval-generation feedback. It then reports Fail-to-Pass rates on SWT-bench Lite and TDD-bench Verified that exceed prior methods by 19.7-31.7%. No equations, fitted parameters, or self-citations appear in the provided text; the performance numbers are measured externally against public benchmarks and do not reduce to the method's own definitions or inputs by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the empirical claim that the three correlations are the main bottlenecks and that LLM-based generation plus execution feedback can exploit them; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5618 in / 1314 out tokens · 29418 ms · 2026-05-10T02:29:08.468144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 6 canonical work pages

[1]

Toufique Ahmed, Jatin Ganhotra, Rangeet Pan, Avraham Shinnar, Saurabh Sinha, and Martin Hirzel. 2025. Otter: Generating Tests from Issues to Validate SWE Patches. InProceedings of the 42nd International Conference on Machine Learning. 752–771

2025
[2]

Toufique Ahmed, Jatin Ganhotra, Avraham Shinnar, and Martin Hirzel. 2026. Heterogeneous Prompting and Execution Feedback for SWE Issue Test Generation and Selection. InProceedings of the 48th International Conference on Software Engineering. 1262–1273

2026
[3]

Toufique Ahmed, Martin Hirzel, Rangeet Pan, Avraham Shinnar, and Saurabh Sinha. 2024. TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?CoRRabs/2412.02883 (2024). arXiv:2412.02883

work page arXiv 2024
[4]

Amazon Web Services. 2024. Amazon Q. https://aws.amazon.com/q/. Accessed: September 1, 2025

2024
[5]

Moritz Beller, Niels Spruit, Diomidis Spinellis, and Andy Zaidman. 2018. On the Dichotomy of Debugging Behavior Among Programmers. InProceedings of the 40th International Conference on Software Engineering. 572–583

2018
[6]

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good bug report?. InProceedings of the 16th International Symposium on Foundations of Software Engineering. 308–318

2008
[7]

Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermea- sures in Code Language Model.CoRRabs/2403.16898 (2024). arXiv:2403.16898

work page arXiv 2024
[8]

Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, Jie Wang, Xiao Cheng, Guangtai Liang, Yuchi Ma, Pan Bian, Tao Xie, and Qianxiang Wang. 2024. CodeR: Issue Resolving with Multi-Agent and Task Graphs.CoRRabs/2406.01304 (2024). arXiv:2406.01304

work page arXiv 2024
[9]

Mouxiang Chen, Zhongxin Liu, He Tao, Yusu Hong, David Lo, Xin Xia, and Jianling Sun. 2024. B4: Towards optimal assessment of plausible code solutions with plausible tests. InProceedings of the 39th International Conference on Automated Software Engineering. 1693–1705

2024
[10]

Ning Chen and Sunghun Kim. 2015. STAR: Stack Trace Based Automatic Crash Reproduction via Symbolic Execution. IEEE Transactions on Software Engineering41, 2 (2015), 198–220

2015
[11]

Runxiang Cheng, Michele Tufano, Jürgen Cito, José Cambronero, Pat Rondon, Renyao Wei, Aaron Sun, and Satish Chandra. 2025. Agentic Bug Reproduction for Effective Automated Program Repair at Google.CoRRabs/2502.01821 (2025). arXiv:2502.01821

work page arXiv 2025
[12]

Ermira Daka and Gordon Fraser. 2014. A Survey on Unit Testing Practices and Problems. InProceedings of the 25th International Symposium on Software Reliability Engineering. 201–211

2014
[13]

Pouria Derakhshanfar, Xavier Devroey, Annibale Panichella, Andy Zaidman, and Arie van Deursen. 2020. Botsing, a Search-based Crash Reproduction Framework for Java. InProceedings of the 35th International Conference on Automated Software Engineering. 1278–1282

2020
[14]

Shihan Dou, Haoxiang Jia, Shenxi Wu, Huiyuan Zheng, Muling Wu, Yunbo Tao, Ming Zhang, Mingxu Chai, Jessica Fan, Zhiheng Xi, et al. 2026. What is wrong with your code generated by large language models? An extensive study. Science China Information Sciences69, 1 (2026), 112107

2026
[15]

Shahriar Golchin and Mihai Surdeanu. 2023. Time Travel in LLMs: Tracing Data Contamination in Large Language Models.CoRRabs/2308.08493 (2023). arXiv:2308.08493

work page arXiv 2023
[16]

Roman Haas, Daniel Elsner, Elmar Juergens, Alexander Pretschner, and Sven Apel. 2021. How can manual testing processes be optimized? developer survey, optimization guidelines, and case studies. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1281–1291

2021
[17]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Large Language Models are Few-Shot Testers: Exploring LLM-Based General Bug Reproduction. InProceedings of the 45th International Conference on Software Engineering. 2312–2323

2023
[18]

Lara Khatib, Noble Saji Mathews, and Meiyappan Nagappan. 2026. AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests. InProceedings of the 48th International Conference on Software Engineering. 3838–3847

2026
[19]

Pavneet Singh Kochhar, Xin Xia, and David Lo. 2019. Practitioners’ Views on Good Software Testing Practices. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. 61–70

2019
[20]

Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, and Li Zhang. 2024. Exploring and Evaluating Hallucinations in LLM-Powered Code Generation.CoRRabs/2404.00971 (2024). arXiv:2404.00971

work page arXiv 2024
[21]

Zhongxin Liu, Kui Liu, Xin Xia, and Xiaohu Yang. 2023. Towards more realistic evaluation for neural test oracle generation. InProceedings of the 32nd International Symposium on Software Testing and Analysis. 589–600

2023
[22]

Niels Mündler, Mark Niklas Mueller, Jingxuan He, and Martin Vechev. 2024. SWT-Bench Lite Leaderboard. https: //swtbench.com/?results=lite. Accessed: 2025-09-12

2024
[23]

Niels Mündler, Mark Niklas Müller, Jingxuan He, and Martin Vechev. 2024. SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents. InProceedings of the 38th International Conference on Neural Information Processing Systems, Vol. 37. 81857–81887. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE186. Publication date: July 2026. FSE186:22 W...

2024
[24]

Mathieu Nayrolles, Abdelwahab Hamou-Lhadj, Sofiène Tahar, and Alf Larsson. 2015. JCHARMING: A bug reproduction approach using crash traces and directed model checking. InProceedings of the 22nd International Conference on Software Analysis, Evolution, and Reengineering. 101–110

2015
[25]

Mohammad Masudur Rahman, Foutse Khomh, and Marco Castelluccio. 2020. Why are Some Bugs Non-Reproducible? : –An Empirical Investigation using Data Fusion–. InProceedings of the 36th International Conference on Software Maintenance and Evolution. 605–616

2020
[26]

Mozhan Soltani, Pouria Derakhshanfar, Annibale Panichella, Xavier Devroey, Andy Zaidman, and Arie van Deursen
[27]

InProceedings of the 10th International Symposium on Search-Based Software Engineering

Single-objective versus multi-objectivized optimization for evolutionary crash reproduction. InProceedings of the 10th International Symposium on Search-Based Software Engineering. 325–340
[28]

Karen Sparck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of documentation28, 1 (1972), 11–21

1972
[29]

Kuo-Chung Tai. 1979. The tree-to-tree correction problem.J. ACM26, 3 (1979), 422–433

1979
[30]

Dhaval Vyas, Thomas Fritz, and David Shepherd. 2014. Bug Reproduction: A Collaborative Practice Within Software Maintenance Activities. InProceedings of the 11th International Conference on the Design of Cooperative Systems. 189–207

2014
[31]

Nalin Wadhwa, Atharv Sonwane, Daman Arora, Abhav Mehrotra, Saiteja Utpala, Ramakrishna B Bairi, Aditya Kanade, and Nagarajan Natarajan. 2024. MASAI: Modular Architecture for Software-engineering AI Agents. InNeurIPS 2024 Workshop on Open-World Agents

2024
[32]

Xinchen Wang, Pengfei Gao, Xiangxin Meng, Chao Peng, Ruida Hu, Yun Lin, and Cuiyun Gao. 2025. AEGIS: An Agent- based Framework for Bug Reproduction from Issue Descriptions. InProceedings of the 33rd International Conference on the Foundations of Software Engineering. 331–342

2025
[33]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

2025
[34]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. Demystifying llm-based software engineer- ing agents.Proceedings of the ACM on Software Engineering2, FSE (2025), 801–824

2025
[35]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
[36]

InProceedings of the 38th International Conference on Neural Information Processing Systems, Vol

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InProceedings of the 38th International Conference on Neural Information Processing Systems, Vol. 37. 50528–50652
[37]

Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and Improving ChatGPT for Unit Test Generation.Proc. ACM Softw. Eng.1, FSE, Article 76 (2024), 24 pages

2024
[38]

Chengming Zhang, Haoye Wang, Chuyang Xu, Jiakun Liu, Kui Liu, and Zhongxin Liu. 2026. Can test cases generated by large language models facilitate automated program repair?Empirical Software Engineering31, 3 (2026), 68

2026
[39]

Kaizhong Zhang and Dennis Shasha. 1989. Simple Fast Algorithms for the Editing Distance between Trees and Related Problems.SIAM J. Comput.18, 6 (1989), 1245–1262

1989
[40]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd International Symposium on Software Testing and Analysis. 1592–1604

2024
[41]

Yuanhe Zhang, Zhiquan Yang, Shengyi Pan, and Zhongxin Liu. 2025. Unit Test Update through LLM-Driven Context Collection and Error-Type-Aware Refinement. InProceedings of the 40th International Conference on Automated Software Engineering. 2504–2515

2025
[42]

Ziyao Zhang, Chong Wang, Yanlin Wang, Ensheng Shi, Yuchi Ma, Wanjun Zhong, Jiachi Chen, Mingzhi Mao, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA022 (2025), 23 pages. Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE186. Publication date: July 20...

2025
[43]

These may include function names, class names, method names, variable names, file names, or other identifiers that directly Proc

Identify and list only the code-related elements that are essential for searching the codebase and understanding or reproducing the bug. These may include function names, class names, method names, variable names, file names, or other identifiers that directly Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE186. Publication date: July 2026. FSE186:24 W...

2026
[44]

feature”, “error

Exclude non-actionable or irrelevant terms, including: - Generic words like “feature”, “error”, “problem”. - User-defined class names or model names that are created within the example bug report but are unlikely to exist in the actual codebase (e.g.,A,B,Cin a sample model definition). - Any abstract or non-code terms that do not directly contribute to debugging
[45]

If an imported element is renamed using as, restore its original module path

Preserve the exact names or formats of the elements as written in the bug report. If an imported element is renamed using as, restore its original module path. For example, if the bug report mentions import pandas as pd , and pd.DataFrame is used in the code, extract it aspandas.DataFrame
[46]

- Additionally, class names and function/method names should be ranked higher than code fragments

Prioritize the extracted elements by their importance for reproducing the bug: - Elements that are most likely to be useful or necessary for understanding the bug should be ranked highest. - Additionally, class names and function/method names should be ranked higher than code fragments. Output format: Provide the extracted code elements as a list in the f...
[47]

Find and recommend some appropriate test functions based on the bug report
[48]

Rank the selected test functions in order of relevance, with the most relevant one first
[49]

path/to/test_file_x.py

Output the name of test function and its file path. The result should contain at most {topk} test cases. ### Available Tools: - list_root(): Lists all files and directories inside the root test folder of the project. You may call this function first. -list_folder(path): Lists files and directories at the given path. - list_classes_and_functions(file_path)...

2026
[50]

Understand the Bug Report and explicitly identify the following information: Observed Behavior (OB): What is the actual, incorrect behavior produced by the current code? Expected Behavior (EB): What is the correct behavior that the user expects?
[51]

code should run successfully without errors

Design the Test Logic Based on the Expected Behavior - Scenario A: If the Expected Behavior (EB) is the “code should run successfully without errors” Your test case should simply call the problematic code directly. DO NOT wrap the code in try...except, pytest.raises, or assertRaises. Reasoning: On the buggy code, the unexpected error (the OB) will be rais...
[52]

Test Requirement The test must be minimal and focused: Only reproduce the issue — do not add extra assertions that are unrelated to the bug
[53]

Focus on understanding the issue and its context

Use Provided Information Effectively The input includes the following sections to assist in test creation: <issue>: Contains the bug report. Focus on understanding the issue and its context. <code>: Contains relevant code snippets. These can help you understand the project Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE186. Publication date: July 2026...

2026