pith. sign in

arxiv: 2503.14281 · v4 · submitted 2025-03-18 · 💻 cs.CR · cs.LG· cs.SE

XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

Pith reviewed 2026-05-22 23:52 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE
keywords context poisoningAI coding assistantsadversarial attacksLLM securitysemantically equivalent codeblack-box attacksCayley Graph
0
0 comments X

The pith

Attackers can poison AI coding assistants by making semantically equivalent modifications to code from different sources, achieving 75.72% success on average with a new graph search algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI coding assistants automatically gather context from multiple origins into prompts without sanitization, creating an opening for attackers to insert subtle poisons. These poisons consist of code changes that keep the same meaning but alter how the model responds, such as producing vulnerable code. The authors introduce XOXO attacks and the GCGS algorithm, which uses a Cayley Graph to explore the space of possible transformations in a black-box way. This leads to high attack success rates across various tasks and models, and shows that standard defenses do not work. A sympathetic reader would care because it reveals a practical way to compromise widely used tools while making the attack hard to spot.

Core claim

We introduce XOXO, a cross-origin context poisoning attack that relies on adversarial but semantically equivalent code modifications to compromise AI coding assistants without detection by traditional analysis. To find effective modifications, we propose GCGS, a task-agnostic black-box algorithm that searches the transformation space using a Cayley Graph, achieving an average attack success rate of 75.72% across five tasks and eleven models including GPT-4.1 and Claude 3.5 Sonnet v2. Adversarial fine-tuning defenses prove ineffective against this approach.

What carries the argument

GCGS algorithm, which systematically searches the transformation space of semantically equivalent code changes using a Cayley Graph structure to identify effective poisoning inputs in a black-box setting.

If this is right

  • The attack succeeds against eleven models used in popular AI coding assistants.
  • Existing defenses such as adversarial fine-tuning fail to mitigate the attack.
  • Attackers can cause the generation of incorrect or vulnerable code while appearing legitimate.
  • The method applies across five different tasks in a task-agnostic manner.
  • Blame for bad outputs can be shifted to the victim developer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI coding tools may need to track the origin and provenance of all context pieces to prevent such poisoning.
  • Similar context aggregation in other LLM applications could be vulnerable to equivalent attacks.
  • Developers should consider manual review or additional verification steps for AI-generated code from large contexts.

Load-bearing premise

Automatic gathering of context from multiple origins into the LLM prompt occurs without any sanitization or checks on where the context came from.

What would settle it

Testing whether applying the GCGS-found transformations to code in a multi-file project causes the AI assistant to output the intended poisoned result on one of the five tasks with one of the eleven models.

Figures

Figures reproduced from arXiv: 2503.14281 by Adam \v{S}torek, Aditya Gupta, Janie Kim, Mukur Gupta, Noopur Bhatt, Prashast Srivastava, Suman Jana.

Figure 1
Figure 1. Figure 1: Comparison between a benign and vulnerable workflow for a developer using GitHub [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the Cross-Origin Context Poisoning (XOXO) attack [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The two phases of GCGS: (1) individual exploration of transforms g, computing α(g(c)), and (2) greedy composition from lowest confi￾dence, descending the tree. For defect and clone detection tasks, we seed identifiers from their respective training sets to avoid out-of-distribution effects in fine-tuned models. For smaller Python datasets (HumanEval+, MBPP+, and CWEval/Python), we extract identifiers from … view at source ↗
Figure 4
Figure 4. Figure 4: Code from Claude 3.5 Sonnet v2 with a subtle bug injected via GCGS attack. The code [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Number of unique tasks where GCGS caused the model to generate code failing at most [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: In-Context Code Generation Chat Prompt Template describing the expected input format [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
read the original abstract

AI coding assistants are widely used for tasks like code generation. These tools now require large and complex contexts, automatically sourced from various origins$\unicode{x2014}$across files, projects, and contributors$\unicode{x2014}$forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant's outputs, potentially generating vulnerable code or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these perturbations since the semantics of the code remains correct, making it appear legitimate. This allows attackers to manipulate coding assistants into producing incorrect outputs, while shifting the blame to the victim developer. We introduce a novel, task-agnostic, black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving a 75.72% attack success rate on average across five tasks and eleven models, including GPT 4.1 and Claude 3.5 Sonnet v2 used by popular AI coding assistants. Furthermore, defenses like adversarial fine-tuning are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces XOXO, a stealthy cross-origin context poisoning attack on AI coding assistants. It argues that automatic aggregation of context from multiple origins (files, projects) into LLM prompts creates an unsanitized attack surface, allowing semantically equivalent code modifications to poison outputs (e.g., inducing vulnerable code) while appearing legitimate. The authors propose GCGS, a task-agnostic black-box algorithm that searches the space of code transformations via a Cayley graph, reporting a 75.72% average attack success rate across five tasks and eleven models (including GPT-4.1 and Claude 3.5 Sonnet v2). They further claim that adversarial fine-tuning is ineffective as a defense.

Significance. If the results hold under realistic multi-origin context aggregation, the work would highlight an important new vulnerability class for LLM-based developer tools. The Cayley-graph search method offers a systematic way to generate stealthy, semantics-preserving adversarial examples, which could influence future defenses in code-generation systems.

major comments (2)
  1. [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The headline 75.72% ASR claim (and the ineffectiveness of adversarial fine-tuning) is presented without any description of the experimental protocol, including how multi-origin context was collected and aggregated (e.g., file inclusion order, deduplication rules, provenance metadata), the exact set of transformations enumerated by GCGS, number of trials per task/model, baselines, or statistical tests. This absence prevents evaluation of whether the central empirical result is reproducible or load-bearing.
  2. [Threat Model and Evaluation Setup] Threat Model and Evaluation Setup: The threat model posits automatic context gathering across origins without sanitization, yet no evidence is supplied that the reported experiments used an actual context-collection pipeline rather than hand-crafted single prompts. If real IDE gatherers apply even lightweight filtering or reordering, the Cayley-graph search may not achieve the claimed success rate; this assumption is load-bearing for the attack's practicality.
minor comments (2)
  1. [Abstract] Clarify model naming (e.g., 'GPT 4.1' in the abstract) and list exact versions and access methods for all eleven models.
  2. [GCGS Algorithm] The description of the Cayley graph construction would benefit from a short pseudocode or diagram showing how semantic equivalence is preserved during search.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity on our experimental protocol and threat model assumptions. We address each comment below and will revise the manuscript accordingly to improve reproducibility and address concerns about realism.

read point-by-point responses
  1. Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The headline 75.72% ASR claim (and the ineffectiveness of adversarial fine-tuning) is presented without any description of the experimental protocol, including how multi-origin context was collected and aggregated (e.g., file inclusion order, deduplication rules, provenance metadata), the exact set of transformations enumerated by GCGS, number of trials per task/model, baselines, or statistical tests. This absence prevents evaluation of whether the central empirical result is reproducible or load-bearing.

    Authors: We agree the abstract omits protocol details (standard for length constraints) and that the experimental section would benefit from explicit enumeration. The full manuscript's Section 4 already specifies context aggregation (concatenation by file modification time with no deduplication), the GCGS transformation set (12 operators including variable renaming and statement reordering), 100 trials per task-model pair, random-search baseline, and mean ASR with standard deviation. In revision we will (1) add a one-sentence protocol summary to the abstract, (2) expand Section 4 with a table listing all GCGS generators and inclusion rules, and (3) report 95% confidence intervals and paired t-tests against the baseline. revision: yes

  2. Referee: [Threat Model and Evaluation Setup] Threat Model and Evaluation Setup: The threat model posits automatic context gathering across origins without sanitization, yet no evidence is supplied that the reported experiments used an actual context-collection pipeline rather than hand-crafted single prompts. If real IDE gatherers apply even lightweight filtering or reordering, the Cayley-graph search may not achieve the claimed success rate; this assumption is load-bearing for the attack's practicality.

    Authors: Our evaluation constructs multi-origin prompts by programmatically combining snippets from distinct files and repositories exactly as described in the threat model (Section 3), without any sanitization step. This matches the automatic aggregation behavior reported for tools such as Cursor and GitHub Copilot. We did not instrument a live IDE gatherer, which is a methodological limitation common to black-box LLM attacks. In the revision we will add a dedicated paragraph in Section 4.2 discussing robustness to common lightweight filters (e.g., duplicate removal, provenance stripping) and include an auxiliary experiment measuring ASR degradation under simulated reordering. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical attack success rates are direct experimental measurements

full rationale

The paper presents an empirical black-box attack (GCGS) and reports measured attack success rates (75.72% average) on external commercial models. No equations, parameter fitting, self-definitional constructs, or load-bearing self-citations appear in the derivation of the central claim. The reported ASR constitutes independent evidence obtained by applying transformations to prompts and observing model outputs, rather than any reduction of the result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security paper. The abstract introduces no free parameters, mathematical axioms, or new physical entities; it relies on the domain assumption that code semantics can be preserved while altering model behavior.

pith-pipeline@v0.9.0 · 5793 in / 1124 out tokens · 38488 ms · 2026-05-22T23:52:32.853162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 11 internal anchors

  1. [1]

    https://www.codeium.com/,

    Codeium: Free ai code completion & chat. https://www.codeium.com/, . Accessed: 2024- 11-08

  2. [2]

    https://sourcegraph.com/cody,

    Cody by sourcegraph. https://sourcegraph.com/cody, . Accessed: 2024-11-08

  3. [3]

    https://continue.dev/

    Continue: Open-source code copilot. https://continue.dev/. Accessed: 2024-11-08

  4. [4]

    https://github.com/features/copilot

    Github copilot. https://github.com/features/copilot. Accessed: 2024-11-08

  5. [5]

    https://www.cursor.so/

    Cursor. https://www.cursor.so/. Accessed: 2024-11-08

  6. [6]

    URL https://cwe.mitre.org/

    CWE - Common Weakness Enumeration. URL https://cwe.mitre.org/

  7. [7]

    URL https://github.com/ meta-llama/PurpleLlama/tree/main/CodeShield

    PurpleLlama/CodeShield at main · meta-llama/PurpleLlama, . URL https://github.com/ meta-llama/PurpleLlama/tree/main/CodeShield

  8. [8]

    URL https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield/ insecure_code_detector

    PurpleLlama/CodeShield/insecure code detector at main · meta-llama/PurpleLlama, . URL https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield/ insecure_code_detector

  9. [9]

    https://www.tabnine.com/

    Tabnine: Ai code completion for all languages. https://www.tabnine.com/. Accessed: 2024-11-08

  10. [10]

    URL https://tree-sitter.github.io/tree-sitter/

    Tree-sitter. URL https://tree-sitter.github.io/tree-sitter/

  11. [11]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  12. [12]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet

  13. [13]

    Adversarial robustness for code, 2020

    Pavol Bielik and Martin Vechev. Adversarial robustness for code, 2020. URL https:// arxiv.org/abs/2002.04694

  14. [14]

    Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Mor- gan

    Casey Casalnuovo, Earl T. Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Mor- gan. A theory of dual channel constraints. In Proceedings of the ACM/IEEE 42nd In- ternational Conference on Software Engineering: New Ideas and Emerging Results , page 25–28. Association for Computing Machinery, 2020. doi: 10.1145/3377816.3381720. URL https://doi.org/10.1145...

  15. [15]

    mitmproxy: A free and open source interactive HTTPS proxy, 2010–

    Aldo Cortesi, Maximilian Hils, Thomas Kriechbaumer, and contributors. mitmproxy: A free and open source interactive HTTPS proxy, 2010–. URLhttps://mitmproxy.org/. [Version 11.0]

  16. [16]

    How gradient created an open llm with a million-token con- text window

    Ben Dickson. How gradient created an open llm with a million-token con- text window. VentureBeat, June 2024. URL https://venturebeat.com/ai/ how-gradient-created-an-open-llm-with-a-million-token-context-window/

  17. [17]

    Vulnerability detection with code language models: How far are we?

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alo- mair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?, 2024. URL https://arxiv.org/abs/2403.18624

  18. [18]

    Django: The Web framework for perfectionists with deadlines,

    Django Software Foundation. Django: The Web framework for perfectionists with deadlines,

  19. [19]

    Accessed: 2024-10-31

    URL https://www.djangoproject.com/. Accessed: 2024-10-31. 10

  20. [20]

    Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview, October 2024

    Thomas Dohmke. Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview, October 2024. URLhttps://github. blog/news-insights/product-news/bringing-developer-choice-to-copilot/

  21. [21]

    An extensive study on adversarial attack against pre-trained models of code

    Xiaohu Du, Ming Wen, Zichao Wei, Shangwen Wang, and Hai Jin. An extensive study on adversarial attack against pre-trained models of code. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 489–501, New York, NY , USA, 2023. Association for Computing Ma...

  22. [22]

    The llama 3 herd of models,

    Abhimanyu Dubey, Abhinav Jauhri, and Abhinav Pandey et al. The llama 3 herd of models,

  23. [23]

    URL https://arxiv.org/abs/2407.21783

  24. [24]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020

  25. [25]

    Github copilot now has a better ai model and new capabilities

    GitHub Blog. Github copilot now has a better ai model and new capabilities. https://github.blog/ai-and-ml/github-copilot/ github-copilot-now-has-a-better-ai-model-and-new-capabilities/ , 2023. Accessed: 2024-11-11

  26. [26]

    Github copilot

    GitHub, Inc. Github copilot. https://code.visualstudio.com/docs/copilot/ overview, 2024. Accessed: 2024-10-17

  27. [27]

    Alex Gu, Wen-Ding Li, Naman Jain, Theo Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama. The counterfeit conundrum: Can code language models grasp the nuances of their incorrect generations? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors, Findings of the Association for Computational Linguistics ACL 2024 , pages 74–117, Bangkok, Th...

  28. [28]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representa- tions with data flow. arXiv preprint arXiv:2009.08366, 2020

  29. [29]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y .K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196

  30. [30]

    Codescm: Causal analysis for multi-modal code generation, 2025

    Mukur Gupta, Noopur Bhatt, and Suman Jana. Codescm: Causal analysis for multi-modal code generation, 2025. URL https://arxiv.org/abs/2502.05150

  31. [31]

    Jaiswal, and Radha Poovendran

    Hossein Hosseini, Baicen Xiao, Mayoore S. Jaiswal, and Radha Poovendran. On the limita- tion of convolutional neural networks in recognizing negative images. 2017 16th IEEE Inter- national Conference on Machine Learning and Applications (ICMLA) , pages 352–358, 2017. URL https://api.semanticscholar.org/CorpusID:24753302

  32. [32]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Ji- ajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

  33. [33]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019

  34. [34]

    Sok: History is a vast early warning system: Auditing the provenance of system intrusions

    Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan. Sok: History is a vast early warning system: Auditing the provenance of system intrusions. In 2023 IEEE Symposium on Security and Privacy (SP), 2023. 11

  35. [35]

    Practical attacks against black-box code completion engines

    Slobodan Jenko, Jingxuan He, Niels M ¨undler, Mark Vero, and Martin Vechev. Practical attacks against black-box code completion engines. arXiv preprint arXiv:2408.02509, 2024

  36. [36]

    Murphy-Hill, and Robert W

    Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. Why don’t soft- ware developers use static analysis tools to find bugs? In 2013 35th International Conference on Software Engineering (ICSE), pages 672–681, 2013. doi: 10.1109/ICSE.2013.6606613

  37. [37]

    Hong Jin Kang, Khai Loong Aw, and David Lo. Detecting false alarms from automatic static analysis tools: how far are we? In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 698–709, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510214. URL https: //doi.org/1...

  38. [38]

    Some problems on cayley graphs

    Elena Konstantinova. Some problems on cayley graphs. Linear Algebra and its applications, 429(11-12):2754–2769, 2008

  39. [39]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023

  40. [40]

    IRIS: LLM-assisted static analysis for detecting security vulnerabilities

    Ziyang Li, Saikat Dutta, and Mayur Naik. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. In The Thirteenth International Conference on Learning Representa- tions, 2025. URL https://openreview.net/forum?id=9LdJDU7E91

  41. [41]

    Liu and S

    D. Liu and S. Zhang. ALANCA: Active learning guided adversarial attacks for code com- prehension on diverse pre-trained and large language models. In 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 602–613, Rovaniemi, Finland, 2024. doi: 10.1109/SANER60148.2024.00067

  42. [42]

    Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=1qvx610Cu7

  43. [43]

    Repoqa: Evaluating long context code understanding

    Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv preprint arXiv:2406.06025, 2024

  44. [44]

    CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning bench- mark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

  45. [45]

    DIP: Dead code insertion based black- box attack for programming language model

    CheolWon Na, YunSeok Choi, and Jee-Hyong Lee. DIP: Dead code insertion based black- box attack for programming language model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 7777–7791, Toronto, Canada, July

  46. [46]

    doi: 10.18653/v1/2023.acl-long.430

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.430. URL https://aclanthology.org/2023.acl-long.430

  47. [47]

    Introducing gpt-4.1 in the api, 2025

    OpenAI. Introducing gpt-4.1 in the api, 2025. URL https://openai.com/index/gpt-4-1

  48. [48]

    2024 developer survey: Ai and software development

    Stack Overflow. 2024 developer survey: Ai and software development. https://survey. stackoverflow.co/2024/ai/, 2024. Accessed: 2024-10-07

  49. [49]

    Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025

    Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, and Baishakhi Ray. Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025. URL https: //arxiv.org/abs/2501.08200

  50. [50]

    Do users write more insecure code with ai assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with ai assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023. 12

  51. [51]

    Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mo- hammad Amin Alipour

    Md Rafiqul Islam Rabin, Nghi D.Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mo- hammad Amin Alipour. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021. doi: https://doi.org/10.1016/j.infsof.2021.106552

  52. [52]

    Semantic robustness of models of source code

    Goutham Ramakrishnan, Jordan Henkel, Thomas Reps, and Somesh Jha. Semantic robustness of models of source code. arXiv preprint arXiv:2002.03043, 2020

  53. [53]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  54. [54]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis, 2020. URL https://arxiv.org/abs/2009.10297

  55. [55]

    Collaborative software development environment

    Replit. Collaborative software development environment. https://replit.com/, 2024. Accessed: 2024-10-07

  56. [56]

    Why larger llm context windows are all the rage

    IBM Research. Why larger llm context windows are all the rage. https://research.ibm. com/blog/larger-context-window, 2023. Accessed: 2024-10-18

  57. [57]

    Anatomy of a coding assistant, 2023

    Quinn Slack. Anatomy of a coding assistant, 2023. URL https://sourcegraph.com/ blog/anatomy-of-a-coding-assistant . Accessed: 2024-10-21

  58. [58]

    Generating adversarial computer programs using optimized obfusca- tions, 2021

    Shashank Srikant, Sijia Liu, Tamara Mitrovska, Shiyu Chang, Quanfu Fan, Gaoyuan Zhang, and Una-May O’Reilly. Generating adversarial computer programs using optimized obfusca- tions, 2021. URL https://arxiv.org/abs/2103.11882

  59. [59]

    Bigcloneeval: A clone detection tool evaluation frame- work with bigclonebench

    Jeffrey Svajlenko and Chanchal K Roy. Bigcloneeval: A clone detection tool evaluation frame- work with bigclonebench. In 2016 IEEE international conference on software maintenance and evolution (ICSME), pages 596–600. IEEE, 2016

  60. [60]

    Codestral, May 2024

    Mistral AI team. Codestral, May 2024. URL https://mistral.ai/news/codestral. publisher: Mistral AI

  61. [61]

    In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings

    Zhao Tian, Junjie Chen, and Zhi Jin. Code difference guided adversarial example genera- tion for deep code models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 850–862, 2023. doi: 10.1109/ASE56229.2023.00149

  62. [62]

    Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024

    Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024. URL https://arxiv.org/ abs/2312.12575

  63. [63]

    ReCode: Robustness evaluation of code generation models

    Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. ReCode: Robustness evaluation of code generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Ann...

  64. [64]

    Detecting code clones with graph neural network and flow-augmented abstract syntax tree

    Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 261–271. IEEE, 2020

  65. [65]

    Bui, Junnan Li, and Steven C

    Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint, 2023. 13

  66. [66]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State- of-the-...

  67. [67]

    Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023

    Fangzhou Wu, Xiaogeng Liu, and Chaowei Xiao. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023. URL https://arxiv.org/ abs/2312.04730

  68. [68]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  69. [69]

    Natural attack for pre-trained models of code

    Zhou Yang, Jieke Shi, Junda He, and David Lo. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering , ICSE ’22, page 1482–1493, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510146. URL https://doi.org/10.1145/ 3510003.3510146

  70. [70]

    Adversarial examples for models of code

    Noam Yefet, Uri Alon, and Eran Yahav. Adversarial examples for models of code. Proc. ACM Program. Lang., 4(OOPSLA), November 2020. doi: 10.1145/3428230. URL https: //doi.org/10.1145/3428230

  71. [71]

    An extensive study on pre-trained models for program understanding and generation

    Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. An extensive study on pre-trained models for program understanding and generation. In Pro- ceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pages 39–51, 2022

  72. [72]

    Generating adversarial examples for holding robustness of source code processing models

    Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, and Zhi Jin. Generating adversarial examples for holding robustness of source code processing models. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1169–1176, Apr. 2020. doi: 10.1609/aaai.v34i01

  73. [73]

    URL https://ojs.aaai.org/index.php/AAAI/article/view/5469

  74. [74]

    Towards robustness of deep program processing models—detection, estimation, and enhancement

    Huangzhao Zhang, Zhiyi Fu, Ge Li, Lei Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu, and Zhi Jin. Towards robustness of deep program processing models—detection, estimation, and enhancement. ACM Trans. Softw. Eng. Methodol., 31(3), April 2022. ISSN 1049-331X. doi: 10.1145/3511887. URL https://doi.org/10.1145/3511887

  75. [75]

    A black-box attack on code models via representation nearest neighbor search

    Jie Zhang, Wei Ma, Qiang Hu, Shangqing Liu, Xiaofei Xie, Yves Le Traon, and Yang Liu. A black-box attack on code models via representation nearest neighbor search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Com- putational Linguistics: EMNLP 2023 , pages 9706–9716, Singapore, December 2023. As- sociation for Com...

  76. [76]

    GitHub Copilot now has a better AI model and new capabil- ities, February 2023

    Shuyin Zhao. GitHub Copilot now has a better AI model and new capabil- ities, February 2023. URL https://github.blog/ai-and-ml/github-copilot/ github-copilot-now-has-a-better-ai-model-and-new-capabilities/

  77. [77]

    Evolutionary multi-objective optimiza- tion for contextual adversarial example generation.Proc

    Shasha Zhou, Mingyu Huang, Yanan Sun, and Ke Li. Evolutionary multi-objective optimiza- tion for contextual adversarial example generation.Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/3660808. URL https://doi.org/10.1145/3660808. 14

  78. [78]

    (𝑐) 𝑐 𝑔#(𝑐) 𝑔!(𝑔

    Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019. A Implementation Transformations. Although the Cayley Graph structure accommodates any semantics-preserving tr...