pith. sign in

arxiv: 2408.15815 · v2 · submitted 2024-08-28 · 💻 cs.SE

MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing

Pith reviewed 2026-05-23 22:15 UTC · model grok-4.3

classification 💻 cs.SE
keywords metamorphic testinginput transformation deductionLLM code generationtest adequacymetamorphic relationssoftware testing automationdata flow analysis
0
0 comments X

The pith

MR-Adopt deduces input transformations from hard-coded metamorphic test cases to allow reuse with new inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MR-Adopt to extract reusable input transformation functions from test cases that encode metamorphic relations but hard-code the inputs. It employs large language models to generate additional source and follow-up input pairs from the single available example, then refines the resulting code with data-flow analysis to eliminate irrelevant parts. The best transformation is chosen by checking how well it satisfies the encoded output relations. This approach succeeds for 72 percent of the relations tested, surpassing vanilla GPT-3.5 by 33 percent, and raises line coverage by 10.62 percent along with mutation scores by 18.91 percent when the transformations are used.

Core claim

MR-Adopt automatically deduces the input transformation from the hard-coded source and follow-up inputs in encoded MR test cases. With typically only one pair available, LLMs generate additional source-followup pairs to guide generalizable transformations, which are refined by removing irrelevant code via data-flow analysis and selected based on the output relations.

What carries the argument

LLM generation of additional input pairs combined with data-flow refinement and output-relation selection to deduce general input transformations.

If this is right

  • Input transformations work for all experimental source inputs in 72.00% of encoded MRs.
  • This rate is 33.33% higher than with vanilla GPT-3.5.
  • Encoded MR-based test cases increase line coverage by 10.62%.
  • Mutation scores rise by 18.91% when using the generated transformations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could apply this to existing test suites to expand coverage without rewriting relations.
  • The method might extend to other forms of property-based testing where relations are partially specified.
  • Integration with test generation frameworks could automate more of the metamorphic testing process.

Load-bearing premise

LLM-generated additional input pairs are representative enough to produce transformations that generalize, and data-flow analysis removes only irrelevant code without losing key mapping logic.

What would settle it

Applying MR-Adopt to a collection of hard-coded MR test cases from unseen projects and finding that fewer than half yield transformations applicable to all new source inputs.

Figures

Figures reproduced from arXiv: 2408.15815 by Congying Xu, Hengcheng Zhu, Jialun Cao, Jiarong Wu, Shing-Chi Cheung, Songqiang Chen, Valerio Terragni.

Figure 1
Figure 1. Figure 1: Overview of MR-Adopt for Metamorphic Testing transformation that aligns with the semantic of the encoded MR, ensuring it applies to all potential source inputs with the corre￾sponding output relation. In this paper, we propose MR-Adopt, an approach that lever￾ages large language models (LLMs) to automatically generate input transformation functions for MRs encoded in existing test cases. Trained on extensi… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of Figure 2: An overview of 374 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

While a recent study reveals that many developer-written test cases can encode a reusable Metamorphic Relation (MR), over 70% of them directly hard-code the source input and follow-up input in the encoded relation. Such encoded MRs, which do not contain an explicit input transformation to transform the source inputs to corresponding follow-up inputs, cannot be reused with new source inputs to enhance test adequacy. In this paper, we propose MR-Adopt (Automatic Deduction Of inPut Transformation) to automatically deduce the input transformation from the hard-coded source and follow-up inputs, aiming to enable the encoded MRs to be reused with new source inputs. With typically only one pair of source and follow-up inputs available in an MR-encoded test case as the example, we leveraged LLMs to understand the intention of the test case and generate additional examples of source-followup input pairs. This helps to guide the generation of input transformations generalizable to multiple source inputs. Besides, to mitigate the issue that LLMs generate erroneous code, we refine LLM-generated transformations by removing MR- irrelevant code elements with data-flow analysis. Finally, we assess candidate transformations based on encoded output relations and select the best transformation as the result. Evaluation results show that MR-Adopt can generate input transformations applicable to all experimental source inputs for 72.00% of encoded MRs, which is 33.33% more than using vanilla GPT-3.5. By incorporating MR- Adopt-generated input transformations, encoded MR-based test cases can effectively enhance the test adequacy, increasing the line coverage and mutation score by 10.62% and 18.91%, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents MR-Adopt, an automated approach to deduce reusable input transformation functions from encoded Metamorphic Relations (MRs) that hard-code source and follow-up inputs. The method uses LLMs to generate additional source-followup input pairs from the single available example in a test case, applies data-flow analysis to refine the generated code by removing irrelevant elements, and selects the best transformation by evaluating it against the encoded output relations. Evaluation on encoded MRs shows that MR-Adopt produces transformations applicable to all experimental source inputs for 72.00% of cases (33.33% higher than vanilla GPT-3.5), and that incorporating these transformations increases line coverage by 10.62% and mutation score by 18.91%.

Significance. If the empirical results hold under rigorous controls, the work addresses a clear practical barrier to reusing encoded MRs in metamorphic testing, potentially allowing a large fraction of existing developer-written tests to be applied to new inputs. The hybrid use of LLMs for example generation combined with static data-flow refinement is a pragmatic strength, and the focus on generalizability from minimal examples aligns with real-world test maintenance needs. The reported gains in coverage and mutation score indicate measurable test-adequacy benefits.

major comments (1)
  1. [Evaluation] Evaluation section: the central claims of 72.00% applicability, 33.33% improvement over vanilla GPT-3.5, 10.62% coverage gain, and 18.91% mutation-score gain are presented without reporting the total number of encoded MRs examined, subject-program selection criteria, number of LLM runs, variance or statistical significance, or explicit controls for prompt sensitivity. These details are load-bearing for assessing whether the percentages reliably support the reusability and adequacy claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback on the evaluation section. We agree that additional methodological details are needed to support the reported results and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the central claims of 72.00% applicability, 33.33% improvement over vanilla GPT-3.5, 10.62% coverage gain, and 18.91% mutation-score gain are presented without reporting the total number of encoded MRs examined, subject-program selection criteria, number of LLM runs, variance or statistical significance, or explicit controls for prompt sensitivity. These details are load-bearing for assessing whether the percentages reliably support the reusability and adequacy claims.

    Authors: We agree that these details are essential for evaluating the reliability of the empirical claims. The current manuscript presents the aggregate percentages without explicitly stating the underlying experimental parameters. In the revised version, we will expand the Evaluation section (likely in a new subsection on experimental setup and threats to validity) to report: the total number of encoded MRs examined, the criteria and process for selecting subject programs, the number of LLM runs (including any repetition for stability), observed variance across runs if applicable, any statistical significance testing performed, and explicit controls or sensitivity analysis for prompt variations. This will allow readers to better assess the generalizability of the 72% applicability rate and the coverage/mutation improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a multi-stage heuristic pipeline (LLM-based pair generation, data-flow refinement, output-relation selection) whose success metrics (72% applicability, coverage/mutation gains) are measured empirically against external benchmarks and test suites. No derivation reduces by construction to fitted parameters, self-citations, or renamed inputs; the central claims rest on observable program behavior and LLM outputs rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution rests on the empirical capability of current LLMs to infer test intent from code and on the correctness of standard data-flow algorithms; no new mathematical constants or entities are introduced.

axioms (2)
  • domain assumption Large language models can produce additional valid source-followup input pairs that reflect the intended metamorphic relation when given only the test code.
    Invoked when the method asks the LLM to generate extra examples to guide transformation synthesis.
  • domain assumption Data-flow analysis can correctly identify and excise MR-irrelevant statements without removing logic required for a correct input mapping.
    Invoked in the refinement step that removes erroneous code elements produced by the LLM.

pith-pipeline@v0.9.0 · 5852 in / 1464 out tokens · 24370 ms · 2026-05-23T22:15:01.111790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 5 internal anchors

  1. [1]

    Rajeev Alur, Rastislav Bodík, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In Formal Methods in Computer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013 . IEEE, 1–8. https://ieeexplore.ieee.org/document/6679385/

  2. [2]

    Jialun Cao, Wuqi Zhang, and Shing-Chi Cheung. 2024. Concerned with Data Contamination? Assessing Countermeasures in Code Language Model. CoRR abs/2403.16898 (2024). https://doi.org/10.48550/ARXIV.2403.16898 arXiv:2403.16898

  3. [3]

    Junkai Chen, Xing Hu, Zhenhao Li, Cuiyun Gao, Xin Xia, and David Lo. 2024. Code Search is All You Need? Improving Code Suggestions with Code Search. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (ICSE ’24). Association for Computing Machinery, New York, NY, USA, Article 73, 13 pages. https://doi.org/10.1145/3597503.3639085

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, and et al. 2021. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/ 2107.03374

  5. [5]

    Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. In 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021 . IEEE, 104–116. https://doi.org/10.1109/ASE51524.2021. 9678670

  6. [6]

    Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic Testing: A Review of Challenges and Opportunities. ACM Comput. Surv. 51, 1 (2018), 4:1–4:27. https://doi.org/10. 1145/3143561

  7. [7]

    Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feed- back. CoRR abs/2402.01391 (2024). https://doi.org/10.48550/ARXI...

  8. [8]

    Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating Large Language Models in Class-Level Code Generation. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 81:1–81:13. https:...

  9. [9]

    Aryaz Eghbali and Michael Pradel. 2024. De-Hallucinator: Iterative Grounding for LLM-Based Code Completion. CoRR abs/2401.01701 (2024). https://doi.org/ 10.48550/ARXIV.2401.01701 arXiv:2401.01701

  10. [10]

    Gordon Fraser and Andrea Arcuri. 2011. EvoSuite: automatic test suite generation for object-oriented software. In SIGSOFT/FSE’11 19th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-19) and ESEC’11: 13th European Software Engineering Conference (ESEC-13), Szeged, Hungary, September 5-9, 2011 , Tibor Gyimóthy and Andreas Zeller (Eds.)...

  11. [11]

    Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao. 2024. Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 3...

  12. [12]

    Sumit Gulwani. 2011. Automating string processing in spreadsheets using input- output examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011, Thomas Ball and Mooly Sagiv (Eds.). ACM, 317–330. https://doi.org/ 10.1145/1926385.1926423

  13. [13]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence. CoRR abs/2401.14196 (2024). https://doi.org/10.48550/ARXIV.2401.14196 arXiv:2401.14196

  14. [14]

    Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curious Case of Neural Text Degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH

  15. [15]

    Kaifeng Huang, Bihuan Chen, Congying Xu, Ying Wang, Bowen Shi, Xin Peng, Yijian Wu, and Yang Liu. 2022. Characterizing usages, updates and risks of third-party libraries in Java projects. Empir. Softw. Eng. 27, 4 (2022), 90. https: //doi.org/10.1007/s10664-022-10131-8

  16. [16]

    Maliheh Izadi, Jonathan Katzy, Tim van Dam, Marc Otten, Razvan Mihai Popescu, and Arie van Deursen. 2024. Language Models for Code Completion: A Practical Evaluation. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 79:1– 79:13. https://doi.org/10.1145/3597503.3639138

  17. [17]

    Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code Evolution Framework via Large Language Models. CoRR abs/2306.02907 (2023). https: //doi.org/10.48550/ARXIV.2306.02907 arXiv:2306.02907

  18. [18]

    Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. In ACM SIGPLAN Conference on Programming Lan- guage Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014 , Michael F. P. O’Boyle and Keshav Pingali (Eds.). ACM, 216–226. https://doi.org/10.1145/2594291.2594334

  19. [19]

    Lahiri, and Siddhartha Sen

    Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen

  20. [20]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023 . IEEE, 919–931. https://doi.org/10.1109/ICSE48619.2023.00085

  21. [21]

    Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. 2023. Chain of Code: Reasoning with a Language Model-Augmented Code Emulator. CoRR abs/2312.04474 (2023). https://doi.org/10.48550/ARXIV.2312.04474 arXiv:2312.04474

  22. [22]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, and et al. 2023. StarCoder: may the source be with you! CoRR abs/2305.06161 (2023). https://doi.org/10.48550/ARXIV.2305.06161 arXiv:2305.06161

  23. [23]

    Competition-Level Code Generation with AlphaCode

    Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrit- twieser, and et al. 2022. Competition-Level Code Generation with Alpha- Code. CoRR abs/2203.07814 (2022). https://doi.org/10.48550/ARXIV.2203.07814 arXiv:2203.07814

  24. [24]

    Mikael Lindvall, Dharmalingam Ganesan, Ragnar Ardal, and Robert E. Wiegand

  25. [25]

    In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 2 , Antonia Bertolino, Ger- ardo Canfora, and Sebastian G

    Metamorphic Model-Based Testing Applied on NASA DAT - An Experi- ence Report. In 37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 2 , Antonia Bertolino, Ger- ardo Canfora, and Sebastian G. Elbaum (Eds.). IEEE Computer Society, 129–138. https://doi.org/10.1109/ICSE.2015.348

  26. [26]

    Huai Liu, Fei-Ching Kuo, Dave Towey, and Tsong Yueh Chen. 2014. How Ef- fectively Does Metamorphic Testing Alleviate the Oracle Problem? IEEE Trans. Software Eng. 40, 1 (2014), 4–22. https://doi.org/10.1109/TSE.2013.46

  27. [27]

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder: Em- powering Code Large Language Models with Evol-Instruct. CoRR abs/2306.08568 (2023). https://doi.org/10.48550/ARXIV.2306.08568 arXiv:2306.08568

  28. [28]

    Haoyang Ma, Qingchao Shen, Yongqiang Tian, Junjie Chen, and Shing-Chi Che- ung. 2023. Fuzzing Deep Learning Compilers with HirGen. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, W A, USA, July 17-21, 2023, René Just and Gordon Fraser (Eds.). ACM, 248–260. https://doi.org/10.1145/359792...

  29. [29]

    Lipeng Ma, Weidong Yang, Bo Xu, Sihang Jiang, Ben Fei, Jiaqing Liang, Mingjie Zhou, and Yanghua Xiao. 2024. KnowLog: Knowledge Enhanced Pre-trained Language Model for Log Understanding. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 32:1–32:13. https://doi.org/10.1...

  30. [30]

    Qiuyang Mang, Aoyang Fang, Boxi Yu, Hanfei Chen, and Pinjia He. 2024. Testing Graph Database Systems via Equivalent Query Rewriting. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 143:1–143:12. https://doi.org/10.1145/ 3597503.3639200

  31. [31]

    Hellendoorn, Bogdan Vasilescu, and Brad A

    Daye Nam, Andrew Macvean, Vincent J. Hellendoorn, Bogdan Vasilescu, and Brad A. Myers. 2024. Using an LLM to Help With Code Understanding. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 97:1–97:13. https://doi.org/ 10.1145/3597503.3639187

  32. [32]

    Wang, and Xi Victoria Lin

    Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-Tau Yih, Sida I. Wang, and Xi Victoria Lin. 2023. LEVER: Learning to Verify Language-to-Code Generation with Execution. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learn- ing Research, Vol. 202), Andreas Krause, Emma ...

  33. [33]

    Carlos Pacheco and Michael D. Ernst. 2007. Randoop: feedback-directed random testing for Java. In Companion to the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2007, October 21-25, 2007, Montreal, Quebec, Canada , Richard P. Gabriel, David F. Bacon, Cristina Videira Lopes, and Guy L. Steel...

  34. [34]

    Lahiri, and Mike Kaufman

    Rangeet Pan, Vu Le, Nachiappan Nagappan, Sumit Gulwani, Shuvendu K. Lahiri, and Mike Kaufman. 2021. Can Program Synthesis be Used to Learn Merge Conflict MR-Adopt: Automatic Deduction of Input Transformation Function for Metamorphic Testing ASE’24, Oct 27 – Nov 1, 2024, Sacramento, California, United States Resolutions? An Empirical Analysis. In 43rd IEEE...

  35. [35]

    Sergio Segura, Gordon Fraser, Ana Belén Sánchez, and Antonio Ruiz Cortés

  36. [36]

    IEEE Trans

    A Survey on Metamorphic Testing. IEEE Trans. Software Eng. 42, 9 (2016), 805–824. https://doi.org/10.1109/TSE.2016.2532875

  37. [37]

    Sergio Segura, José Antonio Parejo, Javier Troya, and Antonio Ruiz Cortés. 2018. Metamorphic Testing of RESTful Web APIs. IEEE Trans. Software Eng. 44, 11 (2018), 1083–1099. https://doi.org/10.1109/TSE.2017.2764464

  38. [38]

    Sergio Segura, José Antonio Parejo, Javier Troya, and Antonio Ruiz Cortés. 2018. Metamorphic testing of RESTful web APIs. InProceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.). ACM, 882. https://doi.org/10.11...

  39. [39]

    Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, Yuenan Guo, and Qianxiang Wang

  40. [40]

    CoRR abs/2307.14936 (2023)

    PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback. CoRR abs/2307.14936 (2023). https://doi.org/10.48550/ARXIV.2307. 14936 arXiv:2307.14936

  41. [41]

    Seung Yeob Shin, Fabrizio Pastore, Domenico Bianculli, and Alexandra Baicoianu

  42. [43]

    Chengnian Sun, Vu Le, and Zhendong Su. 2016. Finding compiler bugs via live code mutation. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, part of SPLASH 2016, Amsterdam, The Netherlands, October 30 - November 4, 2016 , Eelco Visser and Yannis Smaragdakis (E...

  43. [44]

    Chang-Ai Sun, Yiqiang Liu, Zuoyi Wang, and W. K. Chan. 2016. 𝜇MT: a data mutation directed metamorphic relation acquisition methodology. In Proceedings of the 1st International Workshop on Metamorphic Testing, MET@ICSE 2016, Austin, Texas, USA, May 16, 2016 . ACM, 12–18. https://doi.org/10.1145/2896971.2896974

  44. [45]

    Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2024. ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. IEEE Transactions on Software Engineering (2024), 1–19. https://doi.org/10.1109/TSE.2024.3382365

  45. [46]

    MR-Adopt. 2024. MR-Adopt. Retrieved June 6, 2024 from https://mr-adopt. github.io/

  46. [47]

    Christos Tsigkanos, Pooja Rani, Sebastian Müller, and Timo Kehrer. 2023. Variable Discovery with Large Language Models for Metamorphic Testing of Scientific Software. In Computational Science - ICCS 2023 - 23rd International Conference, Prague, Czech Republic, July 3-5, 2023, Proceedings, Part I (Lecture Notes in Com- puter Science, Vol. 14073) , Jirí Mik...

  47. [48]

    Ying Wang, Bihuan Chen, Kaifeng Huang, Bowen Shi, Congying Xu, Xin Peng, Yijian Wu, and Yang Liu. 2020. An Empirical Study of Usages, Updates and Risks of Third-Party Libraries in Java Projects. In IEEE International Conference on Software Maintenance and Evolution, ICSME 2020, Adelaide, Australia, September 28 - October 2, 2020. IEEE, 35–45. https://doi....

  48. [49]

    Taylor Webb, Keith J Holyoak, and Hongjing Lu. 2023. Emergent analogical reasoning in large language models. Nature Human Behaviour 7, 9 (2023), 1526– 1541

  49. [50]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompt- ing Elicits Reasoning in Large Language Models. In Advances in Neural Infor- mation Processing Systems 35: Annual Conference on Neural Information Pro- cessing Systems 2022, NeurIPS 2022, New Orleans, LA, USA...

  50. [51]

    Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023. How Effective Are Neural Networks for Fixing Security Vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, Seattle, W A, USA, July 17-21, 2023, René Just and Gordon Fraser...

  51. [52]

    Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4All: Universal Fuzzing with Large Language Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April 14-20, 2024 . ACM, 126:1–126:13. https://doi. org/10.1145/3597503.3639121

  52. [53]

    Congying Xu, Valerio Terragni, Hengcheng Zhu, Jiarong Wu, and Shing-Chi Cheung. 2024. MR-Scout: Automated Synthesis of Metamorphic Relations from Existing Test Cases. ACM Trans. Softw. Eng. Methodol. (Apr 2024). https://doi. org/10.1145/3656340 Just Accepted

  53. [54]

    Chen Yang, Junjie Chen, Bin Lin, Jianyi Zhou, and Ziqi Wang. 2024. Enhancing LLM-based Test Generation for Hard-to-Cover Branches via Program Analy- sis. CoRR abs/2404.04966 (2024). https://doi.org/10.48550/ARXIV.2404.04966 arXiv:2404.04966

  54. [55]

    Zhen Yang, Fang Liu, Zhongxing Yu, Jacky Wai Keung, Jia Li, Shuo Liu, Yifan Hong, Xiaoxue Ma, Zhi Jin, and Ge Li. 2024. Exploring and Unleashing the Power of Large Language Models in Automated Code Translation. CoRR abs/2404.14646 (2024). https://doi.org/10.48550/ARXIV.2404.14646 arXiv:2404.14646

  55. [56]

    Chi, and Denny Zhou

    Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. 2023. Large Language Models as Analogical Reasoners. CoRR abs/2310.01714 (2023). https://doi.org/10.48550/ ARXIV.2310.01714 arXiv:2310.01714

  56. [57]

    Zhiqiang Yuan, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, Xin Peng, and Yiling Lou. 2024. Evaluating and improving chatgpt for unit test generation. Proceedings of the ACM on Software Engineering 1, FSE (2024), 1703–1726

  57. [58]

    Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. CoRR abs/2305.04207 (2023). https://doi.org/10.48550/ ARXIV.2305.04207 arXiv:2305.04207

  58. [59]

    Bo Zhang, Hongyu Zhang, Junjie Chen, Dan Hao, and Pablo Moscato. 2019. Automatic Discovery and Cleansing of Numerical Metamorphic Relations. In 2019 IEEE International Conference on Software Maintenance and Evolution, ICSME 2019, Cleveland, OH, USA, September 29 - October 4, 2019 . IEEE, 235–245. https: //doi.org/10.1109/ICSME.2019.00035

  59. [60]

    Jie Zhang, Junjie Chen, Dan Hao, Yingfei Xiong, Bing Xie, Lu Zhang, and Hong Mei. 2014. Search-based inference of polynomial metamorphic relations. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, Vasteras, Sweden - September 15 - 19, 2014 , Ivica Crnkovic, Marsha Chechik, and Paul Grünbacher (Eds.). ACM, 701–712. https://d...

  60. [61]

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. CoRR abs/2401.07339 (2024). https://doi.org/10.48550/ARXIV. 2401.07339 arXiv:2401.07339

  61. [62]

    Zhi Quan Zhou, Liqun Sun, Tsong Yueh Chen, and Dave Towey. 2020. Meta- morphic Relations for Enhancing System Understanding and Use. IEEE Trans. Software Eng. 46, 10 (2020), 1120–1154. https://doi.org/10.1109/TSE.2018.2876433