pith. sign in

arxiv: 2404.01535 · v2 · submitted 2024-04-01 · 💻 cs.SE

Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation

Pith reviewed 2026-05-24 02:17 UTC · model grok-4.3

classification 💻 cs.SE
keywords syntactic robustnessLLM code generationmathematical formulasprompt pre-processingrobustness attackscode generation failuresformula reduction
0
0 comments X

The pith

LLMs generate non-equivalent code for math formulas rewritten with different but equivalent syntax; a reduction pre-processor raises robustness from 54% to 74%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines syntactic robustness as the requirement that an LLM code generator must produce semantically equivalent outputs when a mathematical formula in the prompt is replaced by a different syntactic form that expresses the same meaning. Experiments show this property fails in over 45% of cases on average and drops further when attackers deliberately vary formula syntax. A pre-processing step that reduces formulas to a canonical simplified form before prompting improves the success rate to 74.42%. Readers should care because software requirements routinely embed mathematical specifications, and inconsistent code generation undermines reliable use of LLMs in development workflows.

Core claim

Syntactic robustness is formalized as the property that prompts containing mathematically equivalent formulas written with different syntax must produce semantically equivalent code. Assessment across LLMs reveals the property holds in only 54.05% of evaluated cases, with lower rates for prompts requiring mathematical reasoning. Syntactic attack strategies that alter formula presentation without changing meaning further reduce robustness. A pre-processing reduction step that transforms formulas into simplified equivalent forms raises measured syntactic robustness to 74.42%.

What carries the argument

Syntactic robustness: the invariance of generated code semantics under syntactic rewrites of embedded mathematical formulas that preserve their mathematical meaning.

If this is right

  • LLM-based code generation cannot be trusted to respect mathematical equivalence when formula syntax varies in the prompt.
  • Attackers can systematically degrade code output quality by choosing alternate but equivalent formula syntax.
  • A lightweight pre-processing reduction applied to formulas measurably increases the fraction of prompts that produce equivalent code.
  • The robustness gain applies most strongly to prompts that involve mathematical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Current LLMs appear to treat surface syntax of formulas as semantically relevant rather than extracting only the underlying mathematical intent.
  • Prompt standardization via reduction may be a general technique worth testing on other structured elements such as logical expressions or data schemas.
  • The reported percentages depend on the coverage of the test cases; broader benchmarks could show larger or smaller gaps.

Load-bearing premise

Semantic equivalence between the original and modified-prompt code outputs can be reliably and automatically determined across the chosen test cases without false positives or negatives that would alter the reported percentages.

What would settle it

A manual audit of a random sample of generated code pairs, classified as equivalent or inequivalent by the paper's automated checker, that reveals a substantial mismatch rate with human judgment on semantic equivalence.

Figures

Figures reproduced from arXiv: 2404.01535 by Achintya Desai, Laboni Sarker, Mara Downing, Tevfik Bultan.

Figure 1
Figure 1. Figure 1: Prompt Example 1 and the generated code by the LLM-based code generator. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt Example 2 and the code generated by the LLM-based code generator. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Our context-free grammar for univariate polynomial, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mutation rules for equations. R1 E1 = E2 E1 − E2 = 0 (Shift to L.H.S.) R2 E + Q − Q E ; E1 + Q + E2 − Q = 0 E1 + E2 = 0 ; E1 + Q − E2 − Q = 0 E1 − E2 = 0 (Removing redundant addition) R3 E − Q + Q E ; E1 − Q + E2 + Q = 0 E1 + E2 = 0 ; E1 − Q − E2 + Q = 0 E1 − E2 = 0 (Removing redundant subtraction) R4 E × Q = 0 E = 0 ; E1 × Q + E2 × Q = 0 E1 + E2 = 0 ; E1 × Q − E2 × Q = 0 E1 − E2 = 0 (Removing redundant mu… view at source ↗
Figure 5
Figure 5. Figure 5: Reduction rules for equations. Prompt Formula Reduction Reduced Prompt LLM-based Code Generator Generated Code [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Syntactic robustness checking (this workflow is applied in a loop for multiple mutations). [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Syntactic Robustness Degree Vs. Mutation types [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GPT-3.5: Syntactic robustness degree for equations [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GPT-4: Syntactic robustness degree for equations [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Syntactic Robustness Degree Vs. Equation types [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
read the original abstract

Rapid advances in the field of Large Language Models (LLMs) have made LLM-based code generation an important area for investigation. An LLM-based code generator takes a prompt as input and produces code that implements the requirements specified in the prompt. Many software requirements include mathematical formulas that specify the expected behavior of the code to be generated. Given a code generation prompt that contains a mathematical formula, a reasonable expectation is that, if the formula is syntactically modified without changing its semantics, the generated code for the modified prompt should be semantically equivalent. We formalize this concept as syntactic robustness and investigate the syntactic robustness of LLMs as code generators. Our experimental assessment demonstrates that LLMs are not syntactically robust for code generation prompts with formulas, especially for the ones that require mathematical reasoning. We investigate attack strategies that can further deteriorate the syntactic robustness of LLMs. Finally, to mitigate syntactic robustness failures in LLMs, we propose a pre-processing step that uses reductions to transform formulas in prompts to a simplified form. Our experimental results demonstrate that the syntactic robustness of LLM-based code generation improves significantly using our approach, improving syntactic robustness of LLMs from 54.05% to 74.42%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper formalizes syntactic robustness for LLM-based code generation on prompts containing mathematical formulas, empirically demonstrates that LLMs frequently produce semantically inequivalent code under syntactic but semantically equivalent formula changes (baseline 54.05%), shows that targeted attacks can further degrade performance, and proposes a pre-processing mitigation that applies reduction rules to simplify formulas, raising measured robustness to 74.42%.

Significance. If the semantic-equivalence measurements are reliable, the work identifies a concrete, practically relevant failure mode in LLM code generators for mathematically specified requirements and supplies an inexpensive mitigation that yields a substantial measured gain. The empirical framing and the reduction-based defense are strengths that could inform more robust prompt engineering in scientific and engineering code-generation settings.

major comments (1)
  1. [Experimental results / evaluation sections] The headline result (54.05 % → 74.42 %) is obtained by counting cases where code generated from a syntactically altered but semantically identical formula prompt is judged semantically equivalent to the baseline output. No section describes the automated equivalence oracle (test-suite execution, symbolic execution, or other), its validation against human judgment or formal methods, or its false-positive/false-negative rate on the mathematical-formula subset. Because even modest oracle error on a few hundred instances could erase or reverse the reported 20-point gain, this measurement procedure is load-bearing for the central claim.
minor comments (1)
  1. [Abstract] The abstract states concrete percentages without naming the LLMs, prompt corpus size, or statistical tests; a one-sentence summary of the experimental protocol would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our evaluation methodology. The concern about the semantic equivalence oracle is well-taken and directly impacts the interpretability of our headline results. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The headline result (54.05 % → 74.42 %) is obtained by counting cases where code generated from a syntactically altered but semantically identical formula prompt is judged semantically equivalent to the baseline output. No section describes the automated equivalence oracle (test-suite execution, symbolic execution, or other), its validation against human judgment or formal methods, or its false-positive/false-negative rate on the mathematical-formula subset. Because even modest oracle error on a few hundred instances could erase or reverse the reported 20-point gain, this measurement procedure is load-bearing for the central claim.

    Authors: We agree that the manuscript currently lacks a dedicated description of the automated equivalence oracle, its construction, validation, and error characteristics. This omission weakens the transparency of the central claim. In the revised manuscript we will insert a new subsection (tentatively 4.3) under Experimental Setup that (1) specifies the oracle as test-suite execution against problem-specific unit tests, (2) details how the test suites were derived from the original problem statements and manually verified for coverage, (3) reports the results of a human validation study on a random sample of 100 equivalence judgments (including inter-rater agreement), and (4) provides empirical false-positive and false-negative estimates obtained from that validation. These additions will allow readers to assess the reliability of the 20-point improvement directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of LLM robustness

full rationale

The paper reports direct experimental results measuring syntactic robustness of LLMs on code-generation prompts containing formulas, both before and after applying a proposed pre-processing reduction step. The headline percentages (54.05% to 74.42%) are computed from counts of test cases where generated code is judged semantically equivalent; these are external benchmarks against a baseline rather than quantities derived from fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed improvement to the inputs by construction. The evaluation is self-contained and falsifiable via replication on the same prompts.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the experimental construction of syntactic variants, the assumption that reductions preserve semantics, and the choice of evaluation models and test cases; these are not supplied by prior literature and function as study-specific parameters.

free parameters (2)
  • LLM models and prompt corpus
    Specific models and the set of mathematical formulas used to compute the 54.05% and 74.42% figures are selected by the authors.
  • Reduction rules applied
    The particular mathematical reductions chosen for the pre-processing step are defined and selected within the paper.
axioms (2)
  • domain assumption Syntactic variants of formulas preserve original semantics
    Invoked in the definition of syntactic robustness and in the claim that modified prompts should produce equivalent code.
  • domain assumption Semantic equivalence of generated code can be automatically verified
    Required to turn observed code outputs into the reported robustness percentages.
invented entities (1)
  • syntactic robustness metric no independent evidence
    purpose: Quantifies consistency of LLM code output under semantic-preserving syntactic changes to formulas
    Newly defined concept introduced to frame the experiments and mitigation.

pith-pipeline@v0.9.0 · 5752 in / 1369 out tokens · 40556 ms · 2026-05-24T02:17:29.094481+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code

    cs.SE 2026-05 accept novelty 6.0

    A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi- agent-based code generation with iterative testing and optimisation,”arXiv preprint arXiv:2312.13010, 2023

  2. [2]

    An empirical study of the code generation of safety-critical software using llms,

    M. Liu, J. Wang, T. Lin, Q. Ma, Z. Fang, and Y . Wu, “An empirical study of the code generation of safety-critical software using llms,” Applied Sciences, vol. 14, no. 3, p. 1046, 2024

  3. [3]

    Exploring early adopters’ perceptions of chatgpt as a code generation tool,

    G. L. Scoccia, “Exploring early adopters’ perceptions of chatgpt as a code generation tool,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) . IEEE, 2023, pp. 88–93

  4. [4]

    Ai2: Safety and robustness certification of neural networks with abstract interpretation,

    T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev, “Ai2: Safety and robustness certification of neural networks with abstract interpretation,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 3–18

  5. [5]

    Chatgpt for programming numerical methods,

    A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” Journal of Machine Learning for Modeling and Computing , vol. 4, no. 2, 2023

  6. [6]

    Large language models for software engineering: Survey and open problems,

    A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” arXiv preprint arXiv:2310.03533 , 2023

  7. [7]

    “gcc,” https://gcc.gnu.org/, accessed: 2024-03-22

  8. [8]

    “sympy,” https://www.sympy.org/en/index.html, accessed: 2024-03-22

  9. [9]

    Gpt-4 technical report,

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and et al., “Gpt-4 technical report,” 2024

  10. [10]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 , 2021

  11. [11]

    Competition- level code generation with alphacode,

    Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

  12. [12]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474 , 2022

  13. [13]

    Improving chatgpt prompt for code generation,

    C. Liu, B. Xuanlin, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan, “Improving chatgpt prompt for code generation,” 05 2023

  14. [14]

    Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,

    S. Ouyang, J. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” 08 2023

  15. [15]

    A comparative study of code generation using chatgpt 3.5 across 10 programming languages,

    A. Buscemi, “A comparative study of code generation using chatgpt 3.5 across 10 programming languages,” 08 2023

  16. [16]

    A systematic evaluation of large language models of code,

    F. F. Xu, U. Alon, G. Neubig, and V . J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , 2022, pp. 1–10

  17. [17]

    Discovering the syntax and strategies of natural language programming with generative language models,

    E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J. Cai, and M. Terry, “Discovering the syntax and strategies of natural language programming with generative language models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , 2022, pp. 1–19

  18. [18]

    Improving few- shot prompts with relevant static analysis products,

    T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Improving few- shot prompts with relevant static analysis products,” arXiv preprint arXiv:2304.06815, 2023

  19. [19]

    Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,

    B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” 04 2023

  20. [20]

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

    J. Liu, C. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 05 2023

  21. [21]

    Towards enhancing in-context learning for code generation,

    J. Li, Y . Zhao, Y . Li, G. Li, and Z. Jin, “Towards enhancing in-context learning for code generation,” arXiv preprint arXiv:2303.17780 , 2023

  22. [22]

    Piloting copilot and codex: Hot temperature, cold prompts, or black magic?

    J.-B. D ¨oderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting copilot and codex: Hot temperature, cold prompts, or black magic?” arXiv preprint arXiv:2210.14699 , 2022

  23. [23]

    Controlling large language models to generate secure and vulnerable code,

    J. He and M. Vechev, “Controlling large language models to generate secure and vulnerable code,” arXiv e-prints, pp. arXiv–2302, 2023

  24. [24]

    Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

    J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” arXiv preprint arXiv:2303.07839 , 2023

  25. [25]

    Skcoder: A sketch- based approach for automatic code generation,

    J. Li, Y . Li, G. Li, Z. Jin, Y . Hao, and X. Hu, “Skcoder: A sketch- based approach for automatic code generation,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 2023, pp. 2124–2135

  26. [26]

    Enabling programming thinking in large language models toward code generation,

    J. Li, G. Li, Y . Li, and Z. Jin, “Enabling programming thinking in large language models toward code generation,” arXiv preprint arXiv:2305.06599, 2023

  27. [27]

    Selfevolve: A code evolution framework via large language models,

    S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution framework via large language models,” arXiv preprint arXiv:2306.02907 , 2023

  28. [28]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

  29. [29]

    Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,

    B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023

  30. [30]

    A categorical archive of chatgpt failures,

    A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023

  31. [31]

    Large language models of code fail at completing code with potential bugs,

    T. Dinh, J. Zhao, S. Tan, R. Negrinho, L. Lausen, S. Zha, and G. Karypis, “Large language models of code fail at completing code with potential bugs,” Advances in Neural Information Processing Systems , vol. 36, 2024

  32. [32]

    Coco: Testing code generation systems via concretized instructions,

    M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Coco: Testing code generation systems via concretized instructions,” arXiv preprint arXiv:2308.13319, 2023

  33. [33]

    On the robustness of code generation techniques: An empirical study on github copilot,

    A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” 02 2023

  34. [34]

    The marabou framework for verification and analysis of deep neural networks,

    G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zelji ´c et al. , “The marabou framework for verification and analysis of deep neural networks,” in International Conference on Computer Aided Verification . Springer, 2019, pp. 443– 452

  35. [35]

    Reluplex: An efficient smt solver for verifying deep neural networks,

    G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” in International Conference on Computer Aided Verification . Springer, 2017, pp. 97–117

  36. [36]

    Piecewise linear neural networks verification: A comparative study,

    R. Bunel, I. Turkaslan, P. H. Torr, P. Kohli, and M. P. Kumar, “Piecewise linear neural networks verification: A comparative study,” 2018

  37. [37]

    Branch and bound for piecewise linear neural network verification,

    R. Bunel, P. Mudigonda, I. Turkaslan, P. Torr, J. Lu, and P. Kohli, “Branch and bound for piecewise linear neural network verification,” Journal of Machine Learning Research , vol. 21, no. 2020, 2020

  38. [38]

    Concolic testing for deep neural networks,

    Y . Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 109–119

  39. [39]

    Robustness verification of classification deep neural networks via linear programming,

    W. Lin, Z. Yang, X. Chen, Q. Zhao, X. Li, Z. Liu, and J. He, “Robustness verification of classification deep neural networks via linear programming,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 418–11 427

  40. [40]

    Fast and effective robustness certification,

    G. Singh, T. Gehr, M. Mirman, M. P ¨uschel, and M. T. Vechev, “Fast and effective robustness certification,” NeurIPS, vol. 1, no. 4, p. 6, 2018

  41. [41]

    An abstract domain for certifying neural networks,

    G. Singh, T. Gehr, M. P ¨uschel, and M. Vechev, “An abstract domain for certifying neural networks,” Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–30, 2019

  42. [42]

    Formal security analysis of neural networks using symbolic intervals,

    S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in 27th {USENIX} Security Symposium ( {USENIX} Security 18), 2018, pp. 1599–1614

  43. [43]

    Scalable quantitative verification for deep neural networks,

    T. Baluta, Z. L. Chua, K. S. Meel, and P. Saxena, “Scalable quantitative verification for deep neural networks,” in 2021 IEEE/ACM 43rd Interna- tional Conference on Software Engineering (ICSE) . IEEE, 2021, pp. 312–323

  44. [44]

    Deephunter: a coverage-guided fuzz testing framework for deep neural networks,

    X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y . Liu, J. Zhao, B. Li, J. Yin, and S. See, “Deephunter: a coverage-guided fuzz testing framework for deep neural networks,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2019, pp. 146–157

  45. [45]

    Metamorphic testing: a new approach for generating next test cases,

    T. Y . Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: a new approach for generating next test cases,” arXiv preprint arXiv:2002.12543, 2020

  46. [46]

    Large language models: The next frontier for variable discovery within metamorphic testing?

    C. Tsigkanos, P. Rani, S. M ¨uller, and T. Kehrer, “Large language models: The next frontier for variable discovery within metamorphic testing?” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2023, pp. 678–682

  47. [47]

    Assessing robustness of ml-based program analysis tools using metamorphic program transforma- tions,

    L. Applis, A. Panichella, and A. van Deursen, “Assessing robustness of ml-based program analysis tools using metamorphic program transforma- tions,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 1377–1381