Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation

Achintya Desai; Laboni Sarker; Mara Downing; Tevfik Bultan

arxiv: 2404.01535 · v2 · submitted 2024-04-01 · 💻 cs.SE

Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation

Laboni Sarker , Mara Downing , Achintya Desai , Tevfik Bultan This is my paper

Pith reviewed 2026-05-24 02:17 UTC · model grok-4.3

classification 💻 cs.SE

keywords syntactic robustnessLLM code generationmathematical formulasprompt pre-processingrobustness attackscode generation failuresformula reduction

0 comments

The pith

LLMs generate non-equivalent code for math formulas rewritten with different but equivalent syntax; a reduction pre-processor raises robustness from 54% to 74%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines syntactic robustness as the requirement that an LLM code generator must produce semantically equivalent outputs when a mathematical formula in the prompt is replaced by a different syntactic form that expresses the same meaning. Experiments show this property fails in over 45% of cases on average and drops further when attackers deliberately vary formula syntax. A pre-processing step that reduces formulas to a canonical simplified form before prompting improves the success rate to 74.42%. Readers should care because software requirements routinely embed mathematical specifications, and inconsistent code generation undermines reliable use of LLMs in development workflows.

Core claim

Syntactic robustness is formalized as the property that prompts containing mathematically equivalent formulas written with different syntax must produce semantically equivalent code. Assessment across LLMs reveals the property holds in only 54.05% of evaluated cases, with lower rates for prompts requiring mathematical reasoning. Syntactic attack strategies that alter formula presentation without changing meaning further reduce robustness. A pre-processing reduction step that transforms formulas into simplified equivalent forms raises measured syntactic robustness to 74.42%.

What carries the argument

Syntactic robustness: the invariance of generated code semantics under syntactic rewrites of embedded mathematical formulas that preserve their mathematical meaning.

If this is right

LLM-based code generation cannot be trusted to respect mathematical equivalence when formula syntax varies in the prompt.
Attackers can systematically degrade code output quality by choosing alternate but equivalent formula syntax.
A lightweight pre-processing reduction applied to formulas measurably increases the fraction of prompts that produce equivalent code.
The robustness gain applies most strongly to prompts that involve mathematical reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current LLMs appear to treat surface syntax of formulas as semantically relevant rather than extracting only the underlying mathematical intent.
Prompt standardization via reduction may be a general technique worth testing on other structured elements such as logical expressions or data schemas.
The reported percentages depend on the coverage of the test cases; broader benchmarks could show larger or smaller gaps.

Load-bearing premise

Semantic equivalence between the original and modified-prompt code outputs can be reliably and automatically determined across the chosen test cases without false positives or negatives that would alter the reported percentages.

What would settle it

A manual audit of a random sample of generated code pairs, classified as equivalent or inequivalent by the paper's automated checker, that reveals a substantial mismatch rate with human judgment on semantic equivalence.

Figures

Figures reproduced from arXiv: 2404.01535 by Achintya Desai, Laboni Sarker, Mara Downing, Tevfik Bultan.

**Figure 2.** Figure 2: Prompt Example 2 and the code generated by the LLM-based code generator. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Our context-free grammar for univariate polynomial, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Mutation rules for equations. R1 E1 = E2 E1 − E2 = 0 (Shift to L.H.S.) R2 E + Q − Q E ; E1 + Q + E2 − Q = 0 E1 + E2 = 0 ; E1 + Q − E2 − Q = 0 E1 − E2 = 0 (Removing redundant addition) R3 E − Q + Q E ; E1 − Q + E2 + Q = 0 E1 + E2 = 0 ; E1 − Q − E2 + Q = 0 E1 − E2 = 0 (Removing redundant subtraction) R4 E × Q = 0 E = 0 ; E1 × Q + E2 × Q = 0 E1 + E2 = 0 ; E1 × Q − E2 × Q = 0 E1 − E2 = 0 (Removing redundant mu… view at source ↗

**Figure 5.** Figure 5: Reduction rules for equations. Prompt Formula Reduction Reduced Prompt LLM-based Code Generator Generated Code [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 7.** Figure 7: Syntactic robustness checking (this workflow is applied in a loop for multiple mutations). [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Syntactic Robustness Degree Vs. Mutation types [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: GPT-3.5: Syntactic robustness degree for equations [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: GPT-4: Syntactic robustness degree for equations [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Syntactic Robustness Degree Vs. Equation types [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

read the original abstract

Rapid advances in the field of Large Language Models (LLMs) have made LLM-based code generation an important area for investigation. An LLM-based code generator takes a prompt as input and produces code that implements the requirements specified in the prompt. Many software requirements include mathematical formulas that specify the expected behavior of the code to be generated. Given a code generation prompt that contains a mathematical formula, a reasonable expectation is that, if the formula is syntactically modified without changing its semantics, the generated code for the modified prompt should be semantically equivalent. We formalize this concept as syntactic robustness and investigate the syntactic robustness of LLMs as code generators. Our experimental assessment demonstrates that LLMs are not syntactically robust for code generation prompts with formulas, especially for the ones that require mathematical reasoning. We investigate attack strategies that can further deteriorate the syntactic robustness of LLMs. Finally, to mitigate syntactic robustness failures in LLMs, we propose a pre-processing step that uses reductions to transform formulas in prompts to a simplified form. Our experimental results demonstrate that the syntactic robustness of LLM-based code generation improves significantly using our approach, improving syntactic robustness of LLMs from 54.05% to 74.42%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes syntactic robustness for LLM code generation on math-formula prompts and reports a reduction-based mitigation lifting results from 54% to 74%, but the gain rests on an unvalidated semantic-equivalence procedure.

read the letter

The key point is that LLMs fail to produce semantically equivalent code when a prompt's math formula is rewritten in a different but equivalent syntax, and the authors give a name to this failure plus a simple preprocessing fix. They define syntactic robustness, test several models on formula-containing prompts, try attacks that make the problem worse, and show that reducing formulas to a canonical form before generation improves the metric by about 20 points. That mitigation is the most concrete part of the work and targets a real, narrow pain point in code generation for requirements that include equations. The formalization itself is a modest but clean extension of existing robustness ideas to this setting. The soft spot is exactly the one flagged in the stress test. The headline numbers depend on automatically deciding whether two generated programs are semantically equivalent. The abstract supplies no description of the oracle, no validation against human judgment or formal methods, and no error-rate numbers. If that checker is off by even 10-15% on the test cases, the reported improvement becomes hard to distinguish from noise. Without the methods section it is impossible to tell whether the 74% figure is solid or fragile. This is the kind of paper that belongs in a reading group focused on LLM reliability for software engineering. Readers working on robustness or on code generation with formal specs would find the definition and the mitigation idea useful to discuss, even if the experiments need more scrutiny. It is worth sending to referees so the evaluation details can be checked properly rather than desk-rejecting it outright.

Referee Report

1 major / 1 minor

Summary. The paper formalizes syntactic robustness for LLM-based code generation on prompts containing mathematical formulas, empirically demonstrates that LLMs frequently produce semantically inequivalent code under syntactic but semantically equivalent formula changes (baseline 54.05%), shows that targeted attacks can further degrade performance, and proposes a pre-processing mitigation that applies reduction rules to simplify formulas, raising measured robustness to 74.42%.

Significance. If the semantic-equivalence measurements are reliable, the work identifies a concrete, practically relevant failure mode in LLM code generators for mathematically specified requirements and supplies an inexpensive mitigation that yields a substantial measured gain. The empirical framing and the reduction-based defense are strengths that could inform more robust prompt engineering in scientific and engineering code-generation settings.

major comments (1)

[Experimental results / evaluation sections] The headline result (54.05 % → 74.42 %) is obtained by counting cases where code generated from a syntactically altered but semantically identical formula prompt is judged semantically equivalent to the baseline output. No section describes the automated equivalence oracle (test-suite execution, symbolic execution, or other), its validation against human judgment or formal methods, or its false-positive/false-negative rate on the mathematical-formula subset. Because even modest oracle error on a few hundred instances could erase or reverse the reported 20-point gain, this measurement procedure is load-bearing for the central claim.

minor comments (1)

[Abstract] The abstract states concrete percentages without naming the LLMs, prompt corpus size, or statistical tests; a one-sentence summary of the experimental protocol would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our evaluation methodology. The concern about the semantic equivalence oracle is well-taken and directly impacts the interpretability of our headline results. We address it point-by-point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The headline result (54.05 % → 74.42 %) is obtained by counting cases where code generated from a syntactically altered but semantically identical formula prompt is judged semantically equivalent to the baseline output. No section describes the automated equivalence oracle (test-suite execution, symbolic execution, or other), its validation against human judgment or formal methods, or its false-positive/false-negative rate on the mathematical-formula subset. Because even modest oracle error on a few hundred instances could erase or reverse the reported 20-point gain, this measurement procedure is load-bearing for the central claim.

Authors: We agree that the manuscript currently lacks a dedicated description of the automated equivalence oracle, its construction, validation, and error characteristics. This omission weakens the transparency of the central claim. In the revised manuscript we will insert a new subsection (tentatively 4.3) under Experimental Setup that (1) specifies the oracle as test-suite execution against problem-specific unit tests, (2) details how the test suites were derived from the original problem statements and manually verified for coverage, (3) reports the results of a human validation study on a random sample of 100 equivalence judgments (including inter-rater agreement), and (4) provides empirical false-positive and false-negative estimates obtained from that validation. These additions will allow readers to assess the reliability of the 20-point improvement directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements of LLM robustness

full rationale

The paper reports direct experimental results measuring syntactic robustness of LLMs on code-generation prompts containing formulas, both before and after applying a proposed pre-processing reduction step. The headline percentages (54.05% to 74.42%) are computed from counts of test cases where generated code is judged semantically equivalent; these are external benchmarks against a baseline rather than quantities derived from fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed improvement to the inputs by construction. The evaluation is self-contained and falsifiable via replication on the same prompts.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the experimental construction of syntactic variants, the assumption that reductions preserve semantics, and the choice of evaluation models and test cases; these are not supplied by prior literature and function as study-specific parameters.

free parameters (2)

LLM models and prompt corpus
Specific models and the set of mathematical formulas used to compute the 54.05% and 74.42% figures are selected by the authors.
Reduction rules applied
The particular mathematical reductions chosen for the pre-processing step are defined and selected within the paper.

axioms (2)

domain assumption Syntactic variants of formulas preserve original semantics
Invoked in the definition of syntactic robustness and in the claim that modified prompts should produce equivalent code.
domain assumption Semantic equivalence of generated code can be automatically verified
Required to turn observed code outputs into the reported robustness percentages.

invented entities (1)

syntactic robustness metric no independent evidence
purpose: Quantifies consistency of LLM code output under semantic-preserving syntactic changes to formulas
Newly defined concept introduced to frame the experiments and mitigation.

pith-pipeline@v0.9.0 · 5752 in / 1369 out tokens · 40556 ms · 2026-05-24T02:17:29.094481+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi- agent-based code generation with iterative testing and optimisation,”arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

An empirical study of the code generation of safety-critical software using llms,

M. Liu, J. Wang, T. Lin, Q. Ma, Z. Fang, and Y . Wu, “An empirical study of the code generation of safety-critical software using llms,” Applied Sciences, vol. 14, no. 3, p. 1046, 2024

work page 2024
[3]

Exploring early adopters’ perceptions of chatgpt as a code generation tool,

G. L. Scoccia, “Exploring early adopters’ perceptions of chatgpt as a code generation tool,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) . IEEE, 2023, pp. 88–93

work page 2023
[4]

Ai2: Safety and robustness certification of neural networks with abstract interpretation,

T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev, “Ai2: Safety and robustness certification of neural networks with abstract interpretation,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 3–18

work page 2018
[5]

Chatgpt for programming numerical methods,

A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” Journal of Machine Learning for Modeling and Computing , vol. 4, no. 2, 2023

work page 2023
[6]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” arXiv preprint arXiv:2310.03533 , 2023

work page arXiv 2023
[7]

“gcc,” https://gcc.gnu.org/, accessed: 2024-03-22

work page 2024
[8]

“sympy,” https://www.sympy.org/en/index.html, accessed: 2024-03-22

work page 2024
[9]

Gpt-4 technical report,

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and et al., “Gpt-4 technical report,” 2024

work page 2024
[10]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Competition- level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

work page 2022
[12]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Improving chatgpt prompt for code generation,

C. Liu, B. Xuanlin, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan, “Improving chatgpt prompt for code generation,” 05 2023

work page 2023
[14]

Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,

S. Ouyang, J. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” 08 2023

work page 2023
[15]

A comparative study of code generation using chatgpt 3.5 across 10 programming languages,

A. Buscemi, “A comparative study of code generation using chatgpt 3.5 across 10 programming languages,” 08 2023

work page 2023
[16]

A systematic evaluation of large language models of code,

F. F. Xu, U. Alon, G. Neubig, and V . J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , 2022, pp. 1–10

work page 2022
[17]

Discovering the syntax and strategies of natural language programming with generative language models,

E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J. Cai, and M. Terry, “Discovering the syntax and strategies of natural language programming with generative language models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , 2022, pp. 1–19

work page 2022
[18]

Improving few- shot prompts with relevant static analysis products,

T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Improving few- shot prompts with relevant static analysis products,” arXiv preprint arXiv:2304.06815, 2023

work page arXiv 2023
[19]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,

B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” 04 2023

work page 2023
[20]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

J. Liu, C. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 05 2023

work page 2023
[21]

Towards enhancing in-context learning for code generation,

J. Li, Y . Zhao, Y . Li, G. Li, and Z. Jin, “Towards enhancing in-context learning for code generation,” arXiv preprint arXiv:2303.17780 , 2023

work page arXiv 2023
[22]

Piloting copilot and codex: Hot temperature, cold prompts, or black magic?

J.-B. D ¨oderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting copilot and codex: Hot temperature, cold prompts, or black magic?” arXiv preprint arXiv:2210.14699 , 2022

work page arXiv 2022
[23]

Controlling large language models to generate secure and vulnerable code,

J. He and M. Vechev, “Controlling large language models to generate secure and vulnerable code,” arXiv e-prints, pp. arXiv–2302, 2023

work page 2023
[24]

Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” arXiv preprint arXiv:2303.07839 , 2023

work page arXiv 2023
[25]

Skcoder: A sketch- based approach for automatic code generation,

J. Li, Y . Li, G. Li, Z. Jin, Y . Hao, and X. Hu, “Skcoder: A sketch- based approach for automatic code generation,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 2023, pp. 2124–2135

work page 2023
[26]

Enabling programming thinking in large language models toward code generation,

J. Li, G. Li, Y . Li, and Z. Jin, “Enabling programming thinking in large language models toward code generation,” arXiv preprint arXiv:2305.06599, 2023

work page arXiv 2023
[27]

Selfevolve: A code evolution framework via large language models,

S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution framework via large language models,” arXiv preprint arXiv:2306.02907 , 2023

work page arXiv 2023
[28]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,

B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023

work page arXiv 2023
[30]

A categorical archive of chatgpt failures,

A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023

work page arXiv 2023
[31]

Large language models of code fail at completing code with potential bugs,

T. Dinh, J. Zhao, S. Tan, R. Negrinho, L. Lausen, S. Zha, and G. Karypis, “Large language models of code fail at completing code with potential bugs,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024
[32]

Coco: Testing code generation systems via concretized instructions,

M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Coco: Testing code generation systems via concretized instructions,” arXiv preprint arXiv:2308.13319, 2023

work page arXiv 2023
[33]

On the robustness of code generation techniques: An empirical study on github copilot,

A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” 02 2023

work page 2023
[34]

The marabou framework for verification and analysis of deep neural networks,

G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zelji ´c et al. , “The marabou framework for verification and analysis of deep neural networks,” in International Conference on Computer Aided Verification . Springer, 2019, pp. 443– 452

work page 2019
[35]

Reluplex: An efficient smt solver for verifying deep neural networks,

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” in International Conference on Computer Aided Verification . Springer, 2017, pp. 97–117

work page 2017
[36]

Piecewise linear neural networks verification: A comparative study,

R. Bunel, I. Turkaslan, P. H. Torr, P. Kohli, and M. P. Kumar, “Piecewise linear neural networks verification: A comparative study,” 2018

work page 2018
[37]

Branch and bound for piecewise linear neural network verification,

R. Bunel, P. Mudigonda, I. Turkaslan, P. Torr, J. Lu, and P. Kohli, “Branch and bound for piecewise linear neural network verification,” Journal of Machine Learning Research , vol. 21, no. 2020, 2020

work page 2020
[38]

Concolic testing for deep neural networks,

Y . Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 109–119

work page 2018
[39]

Robustness verification of classification deep neural networks via linear programming,

W. Lin, Z. Yang, X. Chen, Q. Zhao, X. Li, Z. Liu, and J. He, “Robustness verification of classification deep neural networks via linear programming,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 418–11 427

work page 2019
[40]

Fast and effective robustness certification,

G. Singh, T. Gehr, M. Mirman, M. P ¨uschel, and M. T. Vechev, “Fast and effective robustness certification,” NeurIPS, vol. 1, no. 4, p. 6, 2018

work page 2018
[41]

An abstract domain for certifying neural networks,

G. Singh, T. Gehr, M. P ¨uschel, and M. Vechev, “An abstract domain for certifying neural networks,” Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–30, 2019

work page 2019
[42]

Formal security analysis of neural networks using symbolic intervals,

S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in 27th {USENIX} Security Symposium ( {USENIX} Security 18), 2018, pp. 1599–1614

work page 2018
[43]

Scalable quantitative verification for deep neural networks,

T. Baluta, Z. L. Chua, K. S. Meel, and P. Saxena, “Scalable quantitative verification for deep neural networks,” in 2021 IEEE/ACM 43rd Interna- tional Conference on Software Engineering (ICSE) . IEEE, 2021, pp. 312–323

work page 2021
[44]

Deephunter: a coverage-guided fuzz testing framework for deep neural networks,

X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y . Liu, J. Zhao, B. Li, J. Yin, and S. See, “Deephunter: a coverage-guided fuzz testing framework for deep neural networks,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2019, pp. 146–157

work page 2019
[45]

Metamorphic testing: a new approach for generating next test cases,

T. Y . Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: a new approach for generating next test cases,” arXiv preprint arXiv:2002.12543, 2020

work page arXiv 2002
[46]

Large language models: The next frontier for variable discovery within metamorphic testing?

C. Tsigkanos, P. Rani, S. M ¨uller, and T. Kehrer, “Large language models: The next frontier for variable discovery within metamorphic testing?” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2023, pp. 678–682

work page 2023
[47]

Assessing robustness of ml-based program analysis tools using metamorphic program transforma- tions,

L. Applis, A. Panichella, and A. van Deursen, “Assessing robustness of ml-based program analysis tools using metamorphic program transforma- tions,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 1377–1381

work page 2021

[1] [1]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi- agent-based code generation with iterative testing and optimisation,”arXiv preprint arXiv:2312.13010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

An empirical study of the code generation of safety-critical software using llms,

M. Liu, J. Wang, T. Lin, Q. Ma, Z. Fang, and Y . Wu, “An empirical study of the code generation of safety-critical software using llms,” Applied Sciences, vol. 14, no. 3, p. 1046, 2024

work page 2024

[3] [3]

Exploring early adopters’ perceptions of chatgpt as a code generation tool,

G. L. Scoccia, “Exploring early adopters’ perceptions of chatgpt as a code generation tool,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW) . IEEE, 2023, pp. 88–93

work page 2023

[4] [4]

Ai2: Safety and robustness certification of neural networks with abstract interpretation,

T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev, “Ai2: Safety and robustness certification of neural networks with abstract interpretation,” in 2018 IEEE Symposium on Security and Privacy (SP). IEEE, 2018, pp. 3–18

work page 2018

[5] [5]

Chatgpt for programming numerical methods,

A. Kashefi and T. Mukerji, “Chatgpt for programming numerical methods,” Journal of Machine Learning for Modeling and Computing , vol. 4, no. 2, 2023

work page 2023

[6] [6]

Large language models for software engineering: Survey and open problems,

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, and J. M. Zhang, “Large language models for software engineering: Survey and open problems,” arXiv preprint arXiv:2310.03533 , 2023

work page arXiv 2023

[7] [7]

“gcc,” https://gcc.gnu.org/, accessed: 2024-03-22

work page 2024

[8] [8]

“sympy,” https://www.sympy.org/en/index.html, accessed: 2024-03-22

work page 2024

[9] [9]

Gpt-4 technical report,

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and et al., “Gpt-4 technical report,” 2024

work page 2024

[10] [10]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Competition- level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al. , “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022

work page 2022

[12] [12]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y . Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Improving chatgpt prompt for code generation,

C. Liu, B. Xuanlin, H. Zhang, N. Zhang, H. Hu, X. Zhang, and M. Yan, “Improving chatgpt prompt for code generation,” 05 2023

work page 2023

[14] [14]

Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,

S. Ouyang, J. Zhang, M. Harman, and M. Wang, “Llm is like a box of chocolates: the non-determinism of chatgpt in code generation,” 08 2023

work page 2023

[15] [15]

A comparative study of code generation using chatgpt 3.5 across 10 programming languages,

A. Buscemi, “A comparative study of code generation using chatgpt 3.5 across 10 programming languages,” 08 2023

work page 2023

[16] [16]

A systematic evaluation of large language models of code,

F. F. Xu, U. Alon, G. Neubig, and V . J. Hellendoorn, “A systematic evaluation of large language models of code,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , 2022, pp. 1–10

work page 2022

[17] [17]

Discovering the syntax and strategies of natural language programming with generative language models,

E. Jiang, E. Toh, A. Molina, K. Olson, C. Kayacik, A. Donsbach, C. J. Cai, and M. Terry, “Discovering the syntax and strategies of natural language programming with generative language models,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , 2022, pp. 1–19

work page 2022

[18] [18]

Improving few- shot prompts with relevant static analysis products,

T. Ahmed, K. S. Pai, P. Devanbu, and E. T. Barr, “Improving few- shot prompts with relevant static analysis products,” arXiv preprint arXiv:2304.06815, 2023

work page arXiv 2023

[19] [19]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,

B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” 04 2023

work page 2023

[20] [20]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

J. Liu, C. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” 05 2023

work page 2023

[21] [21]

Towards enhancing in-context learning for code generation,

J. Li, Y . Zhao, Y . Li, G. Li, and Z. Jin, “Towards enhancing in-context learning for code generation,” arXiv preprint arXiv:2303.17780 , 2023

work page arXiv 2023

[22] [22]

Piloting copilot and codex: Hot temperature, cold prompts, or black magic?

J.-B. D ¨oderlein, M. Acher, D. E. Khelladi, and B. Combemale, “Piloting copilot and codex: Hot temperature, cold prompts, or black magic?” arXiv preprint arXiv:2210.14699 , 2022

work page arXiv 2022

[23] [23]

Controlling large language models to generate secure and vulnerable code,

J. He and M. Vechev, “Controlling large language models to generate secure and vulnerable code,” arXiv e-prints, pp. arXiv–2302, 2023

work page 2023

[24] [24]

Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,

J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” arXiv preprint arXiv:2303.07839 , 2023

work page arXiv 2023

[25] [25]

Skcoder: A sketch- based approach for automatic code generation,

J. Li, Y . Li, G. Li, Z. Jin, Y . Hao, and X. Hu, “Skcoder: A sketch- based approach for automatic code generation,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 2023, pp. 2124–2135

work page 2023

[26] [26]

Enabling programming thinking in large language models toward code generation,

J. Li, G. Li, Y . Li, and Z. Jin, “Enabling programming thinking in large language models toward code generation,” arXiv preprint arXiv:2305.06599, 2023

work page arXiv 2023

[27] [27]

Selfevolve: A code evolution framework via large language models,

S. Jiang, Y . Wang, and Y . Wang, “Selfevolve: A code evolution framework via large language models,” arXiv preprint arXiv:2306.02907 , 2023

work page arXiv 2023

[28] [28]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,

B. Yetis ¸tiren, I. ¨Ozsoy, M. Ayerdem, and E. T ¨uz¨un, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,” arXiv preprint arXiv:2304.10778, 2023

work page arXiv 2023

[30] [30]

A categorical archive of chatgpt failures,

A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023

work page arXiv 2023

[31] [31]

Large language models of code fail at completing code with potential bugs,

T. Dinh, J. Zhao, S. Tan, R. Negrinho, L. Lausen, S. Zha, and G. Karypis, “Large language models of code fail at completing code with potential bugs,” Advances in Neural Information Processing Systems , vol. 36, 2024

work page 2024

[32] [32]

Coco: Testing code generation systems via concretized instructions,

M. Yan, J. Chen, J. M. Zhang, X. Cao, C. Yang, and M. Harman, “Coco: Testing code generation systems via concretized instructions,” arXiv preprint arXiv:2308.13319, 2023

work page arXiv 2023

[33] [33]

On the robustness of code generation techniques: An empirical study on github copilot,

A. Mastropaolo, L. Pascarella, E. Guglielmi, M. Ciniselli, S. Scalabrino, R. Oliveto, and G. Bavota, “On the robustness of code generation techniques: An empirical study on github copilot,” 02 2023

work page 2023

[34] [34]

The marabou framework for verification and analysis of deep neural networks,

G. Katz, D. A. Huang, D. Ibeling, K. Julian, C. Lazarus, R. Lim, P. Shah, S. Thakoor, H. Wu, A. Zelji ´c et al. , “The marabou framework for verification and analysis of deep neural networks,” in International Conference on Computer Aided Verification . Springer, 2019, pp. 443– 452

work page 2019

[35] [35]

Reluplex: An efficient smt solver for verifying deep neural networks,

G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” in International Conference on Computer Aided Verification . Springer, 2017, pp. 97–117

work page 2017

[36] [36]

Piecewise linear neural networks verification: A comparative study,

R. Bunel, I. Turkaslan, P. H. Torr, P. Kohli, and M. P. Kumar, “Piecewise linear neural networks verification: A comparative study,” 2018

work page 2018

[37] [37]

Branch and bound for piecewise linear neural network verification,

R. Bunel, P. Mudigonda, I. Turkaslan, P. Torr, J. Lu, and P. Kohli, “Branch and bound for piecewise linear neural network verification,” Journal of Machine Learning Research , vol. 21, no. 2020, 2020

work page 2020

[38] [38]

Concolic testing for deep neural networks,

Y . Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic testing for deep neural networks,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, 2018, pp. 109–119

work page 2018

[39] [39]

Robustness verification of classification deep neural networks via linear programming,

W. Lin, Z. Yang, X. Chen, Q. Zhao, X. Li, Z. Liu, and J. He, “Robustness verification of classification deep neural networks via linear programming,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 418–11 427

work page 2019

[40] [40]

Fast and effective robustness certification,

G. Singh, T. Gehr, M. Mirman, M. P ¨uschel, and M. T. Vechev, “Fast and effective robustness certification,” NeurIPS, vol. 1, no. 4, p. 6, 2018

work page 2018

[41] [41]

An abstract domain for certifying neural networks,

G. Singh, T. Gehr, M. P ¨uschel, and M. Vechev, “An abstract domain for certifying neural networks,” Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–30, 2019

work page 2019

[42] [42]

Formal security analysis of neural networks using symbolic intervals,

S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” in 27th {USENIX} Security Symposium ( {USENIX} Security 18), 2018, pp. 1599–1614

work page 2018

[43] [43]

Scalable quantitative verification for deep neural networks,

T. Baluta, Z. L. Chua, K. S. Meel, and P. Saxena, “Scalable quantitative verification for deep neural networks,” in 2021 IEEE/ACM 43rd Interna- tional Conference on Software Engineering (ICSE) . IEEE, 2021, pp. 312–323

work page 2021

[44] [44]

Deephunter: a coverage-guided fuzz testing framework for deep neural networks,

X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y . Liu, J. Zhao, B. Li, J. Yin, and S. See, “Deephunter: a coverage-guided fuzz testing framework for deep neural networks,” in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis , 2019, pp. 146–157

work page 2019

[45] [45]

Metamorphic testing: a new approach for generating next test cases,

T. Y . Chen, S. C. Cheung, and S. M. Yiu, “Metamorphic testing: a new approach for generating next test cases,” arXiv preprint arXiv:2002.12543, 2020

work page arXiv 2002

[46] [46]

Large language models: The next frontier for variable discovery within metamorphic testing?

C. Tsigkanos, P. Rani, S. M ¨uller, and T. Kehrer, “Large language models: The next frontier for variable discovery within metamorphic testing?” in 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2023, pp. 678–682

work page 2023

[47] [47]

Assessing robustness of ml-based program analysis tools using metamorphic program transforma- tions,

L. Applis, A. Panichella, and A. van Deursen, “Assessing robustness of ml-based program analysis tools using metamorphic program transforma- tions,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2021, pp. 1377–1381

work page 2021