XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

Adam \v{S}torek; Aditya Gupta; Janie Kim; Mukur Gupta; Noopur Bhatt; Prashast Srivastava; Suman Jana

arxiv: 2503.14281 · v4 · submitted 2025-03-18 · 💻 cs.CR · cs.LG· cs.SE

XOXO: Stealthy Cross-Origin Context Poisoning Attacks against AI Coding Assistants

Adam \v{S}torek , Mukur Gupta , Noopur Bhatt , Aditya Gupta , Janie Kim , Prashast Srivastava , Suman Jana This is my paper

Pith reviewed 2026-05-22 23:52 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE

keywords context poisoningAI coding assistantsadversarial attacksLLM securitysemantically equivalent codeblack-box attacksCayley Graph

0 comments

The pith

Attackers can poison AI coding assistants by making semantically equivalent modifications to code from different sources, achieving 75.72% success on average with a new graph search algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that AI coding assistants automatically gather context from multiple origins into prompts without sanitization, creating an opening for attackers to insert subtle poisons. These poisons consist of code changes that keep the same meaning but alter how the model responds, such as producing vulnerable code. The authors introduce XOXO attacks and the GCGS algorithm, which uses a Cayley Graph to explore the space of possible transformations in a black-box way. This leads to high attack success rates across various tasks and models, and shows that standard defenses do not work. A sympathetic reader would care because it reveals a practical way to compromise widely used tools while making the attack hard to spot.

Core claim

We introduce XOXO, a cross-origin context poisoning attack that relies on adversarial but semantically equivalent code modifications to compromise AI coding assistants without detection by traditional analysis. To find effective modifications, we propose GCGS, a task-agnostic black-box algorithm that searches the transformation space using a Cayley Graph, achieving an average attack success rate of 75.72% across five tasks and eleven models including GPT-4.1 and Claude 3.5 Sonnet v2. Adversarial fine-tuning defenses prove ineffective against this approach.

What carries the argument

GCGS algorithm, which systematically searches the transformation space of semantically equivalent code changes using a Cayley Graph structure to identify effective poisoning inputs in a black-box setting.

If this is right

The attack succeeds against eleven models used in popular AI coding assistants.
Existing defenses such as adversarial fine-tuning fail to mitigate the attack.
Attackers can cause the generation of incorrect or vulnerable code while appearing legitimate.
The method applies across five different tasks in a task-agnostic manner.
Blame for bad outputs can be shifted to the victim developer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI coding tools may need to track the origin and provenance of all context pieces to prevent such poisoning.
Similar context aggregation in other LLM applications could be vulnerable to equivalent attacks.
Developers should consider manual review or additional verification steps for AI-generated code from large contexts.

Load-bearing premise

Automatic gathering of context from multiple origins into the LLM prompt occurs without any sanitization or checks on where the context came from.

What would settle it

Testing whether applying the GCGS-found transformations to code in a multi-file project causes the AI assistant to output the intended poisoned result on one of the five tasks with one of the eleven models.

Figures

Figures reproduced from arXiv: 2503.14281 by Adam \v{S}torek, Aditya Gupta, Janie Kim, Mukur Gupta, Noopur Bhatt, Prashast Srivastava, Suman Jana.

**Figure 2.** Figure 2: An overview of the Cross-Origin Context Poisoning (XOXO) attack [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The two phases of GCGS: (1) individual exploration of transforms g, computing α(g(c)), and (2) greedy composition from lowest confidence, descending the tree. For defect and clone detection tasks, we seed identifiers from their respective training sets to avoid out-of-distribution effects in fine-tuned models. For smaller Python datasets (HumanEval+, MBPP+, and CWEval/Python), we extract identifiers from … view at source ↗

**Figure 4.** Figure 4: Code from Claude 3.5 Sonnet v2 with a subtle bug injected via GCGS attack. The code [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗

**Figure 5.** Figure 5: Number of unique tasks where GCGS caused the model to generate code failing at most [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: In-Context Code Generation Chat Prompt Template describing the expected input format [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

read the original abstract

AI coding assistants are widely used for tasks like code generation. These tools now require large and complex contexts, automatically sourced from various origins$\unicode{x2014}$across files, projects, and contributors$\unicode{x2014}$forming part of the prompt fed to underlying LLMs. This automatic context-gathering introduces new vulnerabilities, allowing attackers to subtly poison input to compromise the assistant's outputs, potentially generating vulnerable code or introducing critical errors. We propose a novel attack, Cross-Origin Context Poisoning (XOXO), that is challenging to detect as it relies on adversarial code modifications that are semantically equivalent. Traditional program analysis techniques struggle to identify these perturbations since the semantics of the code remains correct, making it appear legitimate. This allows attackers to manipulate coding assistants into producing incorrect outputs, while shifting the blame to the victim developer. We introduce a novel, task-agnostic, black-box attack algorithm GCGS that systematically searches the transformation space using a Cayley Graph, achieving a 75.72% attack success rate on average across five tasks and eleven models, including GPT 4.1 and Claude 3.5 Sonnet v2 used by popular AI coding assistants. Furthermore, defenses like adversarial fine-tuning are ineffective against our attack, underscoring the need for new security measures in LLM-powered coding tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper identifies a real attack surface in auto-gathered context for coding assistants and offers a graph-search method to find stealthy changes, but the 75% success claim lacks any visible experimental grounding.

read the letter

The core observation is worth attention: coding assistants pull code from multiple files and origins into the prompt without provenance checks, so an attacker can insert semantically identical but maliciously altered snippets that survive normal analysis. The authors call this XOXO and supply GCGS, a black-box search over transformations framed as a Cayley graph walk. That framing and the target (cross-origin context in IDE tools) are the main new pieces; prior poisoning work exists but has not focused on this collection mechanism in developer workflows.

Referee Report

2 major / 2 minor

Summary. The paper introduces XOXO, a stealthy cross-origin context poisoning attack on AI coding assistants. It argues that automatic aggregation of context from multiple origins (files, projects) into LLM prompts creates an unsanitized attack surface, allowing semantically equivalent code modifications to poison outputs (e.g., inducing vulnerable code) while appearing legitimate. The authors propose GCGS, a task-agnostic black-box algorithm that searches the space of code transformations via a Cayley graph, reporting a 75.72% average attack success rate across five tasks and eleven models (including GPT-4.1 and Claude 3.5 Sonnet v2). They further claim that adversarial fine-tuning is ineffective as a defense.

Significance. If the results hold under realistic multi-origin context aggregation, the work would highlight an important new vulnerability class for LLM-based developer tools. The Cayley-graph search method offers a systematic way to generate stealthy, semantics-preserving adversarial examples, which could influence future defenses in code-generation systems.

major comments (2)

[Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The headline 75.72% ASR claim (and the ineffectiveness of adversarial fine-tuning) is presented without any description of the experimental protocol, including how multi-origin context was collected and aggregated (e.g., file inclusion order, deduplication rules, provenance metadata), the exact set of transformations enumerated by GCGS, number of trials per task/model, baselines, or statistical tests. This absence prevents evaluation of whether the central empirical result is reproducible or load-bearing.
[Threat Model and Evaluation Setup] Threat Model and Evaluation Setup: The threat model posits automatic context gathering across origins without sanitization, yet no evidence is supplied that the reported experiments used an actual context-collection pipeline rather than hand-crafted single prompts. If real IDE gatherers apply even lightweight filtering or reordering, the Cayley-graph search may not achieve the claimed success rate; this assumption is load-bearing for the attack's practicality.

minor comments (2)

[Abstract] Clarify model naming (e.g., 'GPT 4.1' in the abstract) and list exact versions and access methods for all eleven models.
[GCGS Algorithm] The description of the Cayley graph construction would benefit from a short pseudocode or diagram showing how semantic equivalence is preserved during search.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater clarity on our experimental protocol and threat model assumptions. We address each comment below and will revise the manuscript accordingly to improve reproducibility and address concerns about realism.

read point-by-point responses

Referee: [Abstract and Experimental Evaluation] Abstract and Experimental Evaluation: The headline 75.72% ASR claim (and the ineffectiveness of adversarial fine-tuning) is presented without any description of the experimental protocol, including how multi-origin context was collected and aggregated (e.g., file inclusion order, deduplication rules, provenance metadata), the exact set of transformations enumerated by GCGS, number of trials per task/model, baselines, or statistical tests. This absence prevents evaluation of whether the central empirical result is reproducible or load-bearing.

Authors: We agree the abstract omits protocol details (standard for length constraints) and that the experimental section would benefit from explicit enumeration. The full manuscript's Section 4 already specifies context aggregation (concatenation by file modification time with no deduplication), the GCGS transformation set (12 operators including variable renaming and statement reordering), 100 trials per task-model pair, random-search baseline, and mean ASR with standard deviation. In revision we will (1) add a one-sentence protocol summary to the abstract, (2) expand Section 4 with a table listing all GCGS generators and inclusion rules, and (3) report 95% confidence intervals and paired t-tests against the baseline. revision: yes
Referee: [Threat Model and Evaluation Setup] Threat Model and Evaluation Setup: The threat model posits automatic context gathering across origins without sanitization, yet no evidence is supplied that the reported experiments used an actual context-collection pipeline rather than hand-crafted single prompts. If real IDE gatherers apply even lightweight filtering or reordering, the Cayley-graph search may not achieve the claimed success rate; this assumption is load-bearing for the attack's practicality.

Authors: Our evaluation constructs multi-origin prompts by programmatically combining snippets from distinct files and repositories exactly as described in the threat model (Section 3), without any sanitization step. This matches the automatic aggregation behavior reported for tools such as Cursor and GitHub Copilot. We did not instrument a live IDE gatherer, which is a methodological limitation common to black-box LLM attacks. In the revision we will add a dedicated paragraph in Section 4.2 discussing robustness to common lightweight filters (e.g., duplicate removal, provenance stripping) and include an auxiliary experiment measuring ASR degradation under simulated reordering. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical attack success rates are direct experimental measurements

full rationale

The paper presents an empirical black-box attack (GCGS) and reports measured attack success rates (75.72% average) on external commercial models. No equations, parameter fitting, self-definitional constructs, or load-bearing self-citations appear in the derivation of the central claim. The reported ASR constitutes independent evidence obtained by applying transformations to prompts and observing model outputs, rather than any reduction of the result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security paper. The abstract introduces no free parameters, mathematical axioms, or new physical entities; it relies on the domain assumption that code semantics can be preserved while altering model behavior.

pith-pipeline@v0.9.0 · 5793 in / 1124 out tokens · 38488 ms · 2026-05-22T23:52:32.853162+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 11 internal anchors

[1]

https://www.codeium.com/,

Codeium: Free ai code completion & chat. https://www.codeium.com/, . Accessed: 2024- 11-08

work page 2024
[2]

https://sourcegraph.com/cody,

Cody by sourcegraph. https://sourcegraph.com/cody, . Accessed: 2024-11-08

work page 2024
[3]

https://continue.dev/

Continue: Open-source code copilot. https://continue.dev/. Accessed: 2024-11-08

work page 2024
[4]

https://github.com/features/copilot

Github copilot. https://github.com/features/copilot. Accessed: 2024-11-08

work page 2024
[5]

https://www.cursor.so/

Cursor. https://www.cursor.so/. Accessed: 2024-11-08

work page 2024
[6]

URL https://cwe.mitre.org/

CWE - Common Weakness Enumeration. URL https://cwe.mitre.org/

work page
[7]

URL https://github.com/ meta-llama/PurpleLlama/tree/main/CodeShield

PurpleLlama/CodeShield at main · meta-llama/PurpleLlama, . URL https://github.com/ meta-llama/PurpleLlama/tree/main/CodeShield

work page
[8]

URL https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield/ insecure_code_detector

PurpleLlama/CodeShield/insecure code detector at main · meta-llama/PurpleLlama, . URL https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield/ insecure_code_detector

work page
[9]

https://www.tabnine.com/

Tabnine: Ai code completion for all languages. https://www.tabnine.com/. Accessed: 2024-11-08

work page 2024
[10]

URL https://tree-sitter.github.io/tree-sitter/

Tree-sitter. URL https://tree-sitter.github.io/tree-sitter/

work page
[11]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet

work page 2024
[13]

Adversarial robustness for code, 2020

Pavol Bielik and Martin Vechev. Adversarial robustness for code, 2020. URL https:// arxiv.org/abs/2002.04694

work page arXiv 2020
[14]

Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Mor- gan

Casey Casalnuovo, Earl T. Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Mor- gan. A theory of dual channel constraints. In Proceedings of the ACM/IEEE 42nd In- ternational Conference on Software Engineering: New Ideas and Emerging Results , page 25–28. Association for Computing Machinery, 2020. doi: 10.1145/3377816.3381720. URL https://doi.org/10.1145...

work page doi:10.1145/3377816.3381720 2020
[15]

mitmproxy: A free and open source interactive HTTPS proxy, 2010–

Aldo Cortesi, Maximilian Hils, Thomas Kriechbaumer, and contributors. mitmproxy: A free and open source interactive HTTPS proxy, 2010–. URLhttps://mitmproxy.org/. [Version 11.0]

work page 2010
[16]

How gradient created an open llm with a million-token con- text window

Ben Dickson. How gradient created an open llm with a million-token con- text window. VentureBeat, June 2024. URL https://venturebeat.com/ai/ how-gradient-created-an-open-llm-with-a-million-token-context-window/

work page 2024
[17]

Vulnerability detection with code language models: How far are we?

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alo- mair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?, 2024. URL https://arxiv.org/abs/2403.18624

work page arXiv 2024
[18]

Django: The Web framework for perfectionists with deadlines,

Django Software Foundation. Django: The Web framework for perfectionists with deadlines,

work page
[19]

Accessed: 2024-10-31

URL https://www.djangoproject.com/. Accessed: 2024-10-31. 10

work page 2024
[20]

Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview, October 2024

Thomas Dohmke. Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview, October 2024. URLhttps://github. blog/news-insights/product-news/bringing-developer-choice-to-copilot/

work page 2024
[21]

An extensive study on adversarial attack against pre-trained models of code

Xiaohu Du, Ming Wen, Zichao Wei, Shangwen Wang, and Hai Jin. An extensive study on adversarial attack against pre-trained models of code. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 489–501, New York, NY , USA, 2023. Association for Computing Ma...

work page doi:10.1145/3611643.3616356 2023
[22]

The llama 3 herd of models,

Abhimanyu Dubey, Abhinav Jauhri, and Abhinav Pandey et al. The llama 3 herd of models,

work page
[23]

URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv
[24]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[25]

Github copilot now has a better ai model and new capabilities

GitHub Blog. Github copilot now has a better ai model and new capabilities. https://github.blog/ai-and-ml/github-copilot/ github-copilot-now-has-a-better-ai-model-and-new-capabilities/ , 2023. Accessed: 2024-11-11

work page 2023
[26]

Github copilot

GitHub, Inc. Github copilot. https://code.visualstudio.com/docs/copilot/ overview, 2024. Accessed: 2024-10-17

work page 2024
[27]

Alex Gu, Wen-Ding Li, Naman Jain, Theo Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama. The counterfeit conundrum: Can code language models grasp the nuances of their incorrect generations? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors, Findings of the Association for Computational Linguistics ACL 2024 , pages 74–117, Bangkok, Th...

work page doi:10.18653/v1/2024.findings-acl.7 2024
[28]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representa- tions with data flow. arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[29]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y .K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Codescm: Causal analysis for multi-modal code generation, 2025

Mukur Gupta, Noopur Bhatt, and Suman Jana. Codescm: Causal analysis for multi-modal code generation, 2025. URL https://arxiv.org/abs/2502.05150

work page arXiv 2025
[31]

Jaiswal, and Radha Poovendran

Hossein Hosseini, Baicen Xiao, Mayoore S. Jaiswal, and Radha Poovendran. On the limita- tion of convolutional neural networks in recognizing negative images. 2017 16th IEEE Inter- national Conference on Machine Learning and Applications (ICMLA) , pages 352–358, 2017. URL https://api.semanticscholar.org/CorpusID:24753302

work page 2017
[32]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Ji- ajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[34]

Sok: History is a vast early warning system: Auditing the provenance of system intrusions

Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan. Sok: History is a vast early warning system: Auditing the provenance of system intrusions. In 2023 IEEE Symposium on Security and Privacy (SP), 2023. 11

work page 2023
[35]

Practical attacks against black-box code completion engines

Slobodan Jenko, Jingxuan He, Niels M ¨undler, Mark Vero, and Martin Vechev. Practical attacks against black-box code completion engines. arXiv preprint arXiv:2408.02509, 2024

work page arXiv 2024
[36]

Murphy-Hill, and Robert W

Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. Why don’t soft- ware developers use static analysis tools to find bugs? In 2013 35th International Conference on Software Engineering (ICSE), pages 672–681, 2013. doi: 10.1109/ICSE.2013.6606613

work page doi:10.1109/icse.2013.6606613 2013
[37]

Hong Jin Kang, Khai Loong Aw, and David Lo. Detecting false alarms from automatic static analysis tools: how far are we? In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 698–709, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510214. URL https: //doi.org/1...

work page doi:10.1145/3510003.3510214 2022
[38]

Some problems on cayley graphs

Elena Konstantinova. Some problems on cayley graphs. Linear Algebra and its applications, 429(11-12):2754–2769, 2008

work page 2008
[39]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023

work page 2023
[40]

IRIS: LLM-assisted static analysis for detecting security vulnerabilities

Ziyang Li, Saikat Dutta, and Mayur Naik. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. In The Thirteenth International Conference on Learning Representa- tions, 2025. URL https://openreview.net/forum?id=9LdJDU7E91

work page 2025
[41]

Liu and S

D. Liu and S. Zhang. ALANCA: Active learning guided adversarial attacks for code com- prehension on diverse pre-trained and large language models. In 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 602–613, Rovaniemi, Finland, 2024. doi: 10.1109/SANER60148.2024.00067

work page doi:10.1109/saner60148.2024.00067 2024
[42]

Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=1qvx610Cu7

work page 2023
[43]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv preprint arXiv:2406.06025, 2024

work page arXiv 2024
[44]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning bench- mark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[45]

DIP: Dead code insertion based black- box attack for programming language model

CheolWon Na, YunSeok Choi, and Jee-Hyong Lee. DIP: Dead code insertion based black- box attack for programming language model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 7777–7791, Toronto, Canada, July

work page
[46]

doi: 10.18653/v1/2023.acl-long.430

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.430. URL https://aclanthology.org/2023.acl-long.430

work page doi:10.18653/v1/2023.acl-long.430 2023
[47]

Introducing gpt-4.1 in the api, 2025

OpenAI. Introducing gpt-4.1 in the api, 2025. URL https://openai.com/index/gpt-4-1

work page 2025
[48]

2024 developer survey: Ai and software development

Stack Overflow. 2024 developer survey: Ai and software development. https://survey. stackoverflow.co/2024/ai/, 2024. Accessed: 2024-10-07

work page 2024
[49]

Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025

Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, and Baishakhi Ray. Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025. URL https: //arxiv.org/abs/2501.08200

work page arXiv 2025
[50]

Do users write more insecure code with ai assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with ai assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023. 12

work page 2023
[51]

Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mo- hammad Amin Alipour

Md Rafiqul Islam Rabin, Nghi D.Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mo- hammad Amin Alipour. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021. doi: https://doi.org/10.1016/j.infsof.2021.106552

work page doi:10.1016/j.infsof.2021.106552 2021
[52]

Semantic robustness of models of source code

Goutham Ramakrishnan, Jordan Henkel, Thomas Reps, and Somesh Jha. Semantic robustness of models of source code. arXiv preprint arXiv:2002.03043, 2020

work page arXiv 2002
[53]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis, 2020. URL https://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020
[55]

Collaborative software development environment

Replit. Collaborative software development environment. https://replit.com/, 2024. Accessed: 2024-10-07

work page 2024
[56]

Why larger llm context windows are all the rage

IBM Research. Why larger llm context windows are all the rage. https://research.ibm. com/blog/larger-context-window, 2023. Accessed: 2024-10-18

work page 2023
[57]

Anatomy of a coding assistant, 2023

Quinn Slack. Anatomy of a coding assistant, 2023. URL https://sourcegraph.com/ blog/anatomy-of-a-coding-assistant . Accessed: 2024-10-21

work page 2023
[58]

Generating adversarial computer programs using optimized obfusca- tions, 2021

Shashank Srikant, Sijia Liu, Tamara Mitrovska, Shiyu Chang, Quanfu Fan, Gaoyuan Zhang, and Una-May O’Reilly. Generating adversarial computer programs using optimized obfusca- tions, 2021. URL https://arxiv.org/abs/2103.11882

work page arXiv 2021
[59]

Bigcloneeval: A clone detection tool evaluation frame- work with bigclonebench

Jeffrey Svajlenko and Chanchal K Roy. Bigcloneeval: A clone detection tool evaluation frame- work with bigclonebench. In 2016 IEEE international conference on software maintenance and evolution (ICSME), pages 596–600. IEEE, 2016

work page 2016
[60]

Codestral, May 2024

Mistral AI team. Codestral, May 2024. URL https://mistral.ai/news/codestral. publisher: Mistral AI

work page 2024
[61]

In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings

Zhao Tian, Junjie Chen, and Zhi Jin. Code difference guided adversarial example genera- tion for deep code models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 850–862, 2023. doi: 10.1109/ASE56229.2023.00149

work page doi:10.1109/ase56229.2023.00149 2023
[62]

Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024. URL https://arxiv.org/ abs/2312.12575

work page arXiv 2024
[63]

ReCode: Robustness evaluation of code generation models

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. ReCode: Robustness evaluation of code generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Ann...

work page doi:10.18653/v1/2023.acl-long.773 2023
[64]

Detecting code clones with graph neural network and flow-augmented abstract syntax tree

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 261–271. IEEE, 2020

work page 2020
[65]

Bui, Junnan Li, and Steven C

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint, 2023. 13

work page 2023
[66]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State- of-the-...

work page 2020
[67]

Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023

Fangzhou Wu, Xiaogeng Liu, and Chaowei Xiao. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023. URL https://arxiv.org/ abs/2312.04730

work page arXiv 2023
[68]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Natural attack for pre-trained models of code

Zhou Yang, Jieke Shi, Junda He, and David Lo. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering , ICSE ’22, page 1482–1493, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510146. URL https://doi.org/10.1145/ 3510003.3510146

work page doi:10.1145/3510003.3510146 2022
[70]

Adversarial examples for models of code

Noam Yefet, Uri Alon, and Eran Yahav. Adversarial examples for models of code. Proc. ACM Program. Lang., 4(OOPSLA), November 2020. doi: 10.1145/3428230. URL https: //doi.org/10.1145/3428230

work page doi:10.1145/3428230 2020
[71]

An extensive study on pre-trained models for program understanding and generation

Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. An extensive study on pre-trained models for program understanding and generation. In Pro- ceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pages 39–51, 2022

work page 2022
[72]

Generating adversarial examples for holding robustness of source code processing models

Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, and Zhi Jin. Generating adversarial examples for holding robustness of source code processing models. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1169–1176, Apr. 2020. doi: 10.1609/aaai.v34i01

work page doi:10.1609/aaai.v34i01 2020
[73]

URL https://ojs.aaai.org/index.php/AAAI/article/view/5469

work page
[74]

Towards robustness of deep program processing models—detection, estimation, and enhancement

Huangzhao Zhang, Zhiyi Fu, Ge Li, Lei Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu, and Zhi Jin. Towards robustness of deep program processing models—detection, estimation, and enhancement. ACM Trans. Softw. Eng. Methodol., 31(3), April 2022. ISSN 1049-331X. doi: 10.1145/3511887. URL https://doi.org/10.1145/3511887

work page doi:10.1145/3511887 2022
[75]

A black-box attack on code models via representation nearest neighbor search

Jie Zhang, Wei Ma, Qiang Hu, Shangqing Liu, Xiaofei Xie, Yves Le Traon, and Yang Liu. A black-box attack on code models via representation nearest neighbor search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Com- putational Linguistics: EMNLP 2023 , pages 9706–9716, Singapore, December 2023. As- sociation for Com...

work page doi:10.18653/v1/2023.findings-emnlp.649 2023
[76]

GitHub Copilot now has a better AI model and new capabil- ities, February 2023

Shuyin Zhao. GitHub Copilot now has a better AI model and new capabil- ities, February 2023. URL https://github.blog/ai-and-ml/github-copilot/ github-copilot-now-has-a-better-ai-model-and-new-capabilities/

work page 2023
[77]

Evolutionary multi-objective optimiza- tion for contextual adversarial example generation.Proc

Shasha Zhou, Mingyu Huang, Yanan Sun, and Ke Li. Evolutionary multi-objective optimiza- tion for contextual adversarial example generation.Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/3660808. URL https://doi.org/10.1145/3660808. 14

work page doi:10.1145/3660808 2024
[78]

(𝑐) 𝑐 𝑔#(𝑐) 𝑔!(𝑔

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019. A Implementation Transformations. Although the Cayley Graph structure accommodates any semantics-preserving tr...

work page 2019

[1] [1]

https://www.codeium.com/,

Codeium: Free ai code completion & chat. https://www.codeium.com/, . Accessed: 2024- 11-08

work page 2024

[2] [2]

https://sourcegraph.com/cody,

Cody by sourcegraph. https://sourcegraph.com/cody, . Accessed: 2024-11-08

work page 2024

[3] [3]

https://continue.dev/

Continue: Open-source code copilot. https://continue.dev/. Accessed: 2024-11-08

work page 2024

[4] [4]

https://github.com/features/copilot

Github copilot. https://github.com/features/copilot. Accessed: 2024-11-08

work page 2024

[5] [5]

https://www.cursor.so/

Cursor. https://www.cursor.so/. Accessed: 2024-11-08

work page 2024

[6] [6]

URL https://cwe.mitre.org/

CWE - Common Weakness Enumeration. URL https://cwe.mitre.org/

work page

[7] [7]

URL https://github.com/ meta-llama/PurpleLlama/tree/main/CodeShield

PurpleLlama/CodeShield at main · meta-llama/PurpleLlama, . URL https://github.com/ meta-llama/PurpleLlama/tree/main/CodeShield

work page

[8] [8]

URL https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield/ insecure_code_detector

PurpleLlama/CodeShield/insecure code detector at main · meta-llama/PurpleLlama, . URL https://github.com/meta-llama/PurpleLlama/tree/main/CodeShield/ insecure_code_detector

work page

[9] [9]

https://www.tabnine.com/

Tabnine: Ai code completion for all languages. https://www.tabnine.com/. Accessed: 2024-11-08

work page 2024

[10] [10]

URL https://tree-sitter.github.io/tree-sitter/

Tree-sitter. URL https://tree-sitter.github.io/tree-sitter/

work page

[11] [11]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet

work page 2024

[13] [13]

Adversarial robustness for code, 2020

Pavol Bielik and Martin Vechev. Adversarial robustness for code, 2020. URL https:// arxiv.org/abs/2002.04694

work page arXiv 2020

[14] [14]

Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Mor- gan

Casey Casalnuovo, Earl T. Barr, Santanu Kumar Dash, Prem Devanbu, and Emily Mor- gan. A theory of dual channel constraints. In Proceedings of the ACM/IEEE 42nd In- ternational Conference on Software Engineering: New Ideas and Emerging Results , page 25–28. Association for Computing Machinery, 2020. doi: 10.1145/3377816.3381720. URL https://doi.org/10.1145...

work page doi:10.1145/3377816.3381720 2020

[15] [15]

mitmproxy: A free and open source interactive HTTPS proxy, 2010–

Aldo Cortesi, Maximilian Hils, Thomas Kriechbaumer, and contributors. mitmproxy: A free and open source interactive HTTPS proxy, 2010–. URLhttps://mitmproxy.org/. [Version 11.0]

work page 2010

[16] [16]

How gradient created an open llm with a million-token con- text window

Ben Dickson. How gradient created an open llm with a million-token con- text window. VentureBeat, June 2024. URL https://venturebeat.com/ai/ how-gradient-created-an-open-llm-with-a-million-token-context-window/

work page 2024

[17] [17]

Vulnerability detection with code language models: How far are we?

Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alo- mair, David Wagner, Baishakhi Ray, and Yizheng Chen. Vulnerability detection with code language models: How far are we?, 2024. URL https://arxiv.org/abs/2403.18624

work page arXiv 2024

[18] [18]

Django: The Web framework for perfectionists with deadlines,

Django Software Foundation. Django: The Web framework for perfectionists with deadlines,

work page

[19] [19]

Accessed: 2024-10-31

URL https://www.djangoproject.com/. Accessed: 2024-10-31. 10

work page 2024

[20] [20]

Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview, October 2024

Thomas Dohmke. Bringing developer choice to Copilot with Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s o1-preview, October 2024. URLhttps://github. blog/news-insights/product-news/bringing-developer-choice-to-copilot/

work page 2024

[21] [21]

An extensive study on adversarial attack against pre-trained models of code

Xiaohu Du, Ming Wen, Zichao Wei, Shangwen Wang, and Hai Jin. An extensive study on adversarial attack against pre-trained models of code. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 489–501, New York, NY , USA, 2023. Association for Computing Ma...

work page doi:10.1145/3611643.3616356 2023

[22] [22]

The llama 3 herd of models,

Abhimanyu Dubey, Abhinav Jauhri, and Abhinav Pandey et al. The llama 3 herd of models,

work page

[23] [23]

URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[25] [25]

Github copilot now has a better ai model and new capabilities

GitHub Blog. Github copilot now has a better ai model and new capabilities. https://github.blog/ai-and-ml/github-copilot/ github-copilot-now-has-a-better-ai-model-and-new-capabilities/ , 2023. Accessed: 2024-11-11

work page 2023

[26] [26]

Github copilot

GitHub, Inc. Github copilot. https://code.visualstudio.com/docs/copilot/ overview, 2024. Accessed: 2024-10-17

work page 2024

[27] [27]

Alex Gu, Wen-Ding Li, Naman Jain, Theo Olausson, Celine Lee, Koushik Sen, and Armando Solar-Lezama. The counterfeit conundrum: Can code language models grasp the nuances of their incorrect generations? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, ed- itors, Findings of the Association for Computational Linguistics ACL 2024 , pages 74–117, Bangkok, Th...

work page doi:10.18653/v1/2024.findings-acl.7 2024

[28] [28]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representa- tions with data flow. arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[29] [29]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y .K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024. URL https://arxiv.org/abs/2401.14196

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Codescm: Causal analysis for multi-modal code generation, 2025

Mukur Gupta, Noopur Bhatt, and Suman Jana. Codescm: Causal analysis for multi-modal code generation, 2025. URL https://arxiv.org/abs/2502.05150

work page arXiv 2025

[31] [31]

Jaiswal, and Radha Poovendran

Hossein Hosseini, Baicen Xiao, Mayoore S. Jaiswal, and Radha Poovendran. On the limita- tion of convolutional neural networks in recognizing negative images. 2017 16th IEEE Inter- national Conference on Machine Learning and Applications (ICMLA) , pages 352–358, 2017. URL https://api.semanticscholar.org/CorpusID:24753302

work page 2017

[32] [32]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Ji- ajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[34] [34]

Sok: History is a vast early warning system: Auditing the provenance of system intrusions

Muhammad Adil Inam, Yinfang Chen, Akul Goyal, Jason Liu, Jaron Mink, Noor Michael, Sneha Gaur, Adam Bates, and Wajih Ul Hassan. Sok: History is a vast early warning system: Auditing the provenance of system intrusions. In 2023 IEEE Symposium on Security and Privacy (SP), 2023. 11

work page 2023

[35] [35]

Practical attacks against black-box code completion engines

Slobodan Jenko, Jingxuan He, Niels M ¨undler, Mark Vero, and Martin Vechev. Practical attacks against black-box code completion engines. arXiv preprint arXiv:2408.02509, 2024

work page arXiv 2024

[36] [36]

Murphy-Hill, and Robert W

Brittany Johnson, Yoonki Song, Emerson Murphy-Hill, and Robert Bowdidge. Why don’t soft- ware developers use static analysis tools to find bugs? In 2013 35th International Conference on Software Engineering (ICSE), pages 672–681, 2013. doi: 10.1109/ICSE.2013.6606613

work page doi:10.1109/icse.2013.6606613 2013

[37] [37]

Hong Jin Kang, Khai Loong Aw, and David Lo. Detecting false alarms from automatic static analysis tools: how far are we? In Proceedings of the 44th International Conference on Software Engineering, ICSE ’22, page 698–709, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510214. URL https: //doi.org/1...

work page doi:10.1145/3510003.3510214 2022

[38] [38]

Some problems on cayley graphs

Elena Konstantinova. Some problems on cayley graphs. Linear Algebra and its applications, 429(11-12):2754–2769, 2008

work page 2008

[39] [39]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023

work page 2023

[40] [40]

IRIS: LLM-assisted static analysis for detecting security vulnerabilities

Ziyang Li, Saikat Dutta, and Mayur Naik. IRIS: LLM-assisted static analysis for detecting security vulnerabilities. In The Thirteenth International Conference on Learning Representa- tions, 2025. URL https://openreview.net/forum?id=9LdJDU7E91

work page 2025

[41] [41]

Liu and S

D. Liu and S. Zhang. ALANCA: Active learning guided adversarial attacks for code com- prehension on diverse pre-trained and large language models. In 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) , pages 602–613, Rovaniemi, Finland, 2024. doi: 10.1109/SANER60148.2024.00067

work page doi:10.1109/saner60148.2024.00067 2024

[42] [42]

Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https: //openreview.net/forum?id=1qvx610Cu7

work page 2023

[43] [43]

Repoqa: Evaluating long context code understanding

Jiawei Liu, Jia Le Tian, Vijay Daita, Yuxiang Wei, Yifeng Ding, Yuhan Katherine Wang, Jun Yang, and Lingming Zhang. Repoqa: Evaluating long context code understanding. arXiv preprint arXiv:2406.06025, 2024

work page arXiv 2024

[44] [44]

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning bench- mark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [45]

DIP: Dead code insertion based black- box attack for programming language model

CheolWon Na, YunSeok Choi, and Jee-Hyong Lee. DIP: Dead code insertion based black- box attack for programming language model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers) , pages 7777–7791, Toronto, Canada, July

work page

[46] [46]

doi: 10.18653/v1/2023.acl-long.430

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.430. URL https://aclanthology.org/2023.acl-long.430

work page doi:10.18653/v1/2023.acl-long.430 2023

[47] [47]

Introducing gpt-4.1 in the api, 2025

OpenAI. Introducing gpt-4.1 in the api, 2025. URL https://openai.com/index/gpt-4-1

work page 2025

[48] [48]

2024 developer survey: Ai and software development

Stack Overflow. 2024 developer survey: Ai and software development. https://survey. stackoverflow.co/2024/ai/, 2024. Accessed: 2024-10-07

work page 2024

[49] [49]

Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025

Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, and Baishakhi Ray. Cweval: Outcome- driven evaluation on functionality and security of llm code generation, 2025. URL https: //arxiv.org/abs/2501.08200

work page arXiv 2025

[50] [50]

Do users write more insecure code with ai assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with ai assistants? In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023. 12

work page 2023

[51] [51]

Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mo- hammad Amin Alipour

Md Rafiqul Islam Rabin, Nghi D.Q. Bui, Ke Wang, Yijun Yu, Lingxiao Jiang, and Mo- hammad Amin Alipour. On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology, 2021. doi: https://doi.org/10.1016/j.infsof.2021.106552

work page doi:10.1016/j.infsof.2021.106552 2021

[52] [52]

Semantic robustness of models of source code

Goutham Ramakrishnan, Jordan Henkel, Thomas Reps, and Somesh Jha. Semantic robustness of models of source code. arXiv preprint arXiv:2002.03043, 2020

work page arXiv 2002

[53] [53]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis, 2020. URL https://arxiv.org/abs/2009.10297

work page internal anchor Pith review Pith/arXiv arXiv 2020

[55] [55]

Collaborative software development environment

Replit. Collaborative software development environment. https://replit.com/, 2024. Accessed: 2024-10-07

work page 2024

[56] [56]

Why larger llm context windows are all the rage

IBM Research. Why larger llm context windows are all the rage. https://research.ibm. com/blog/larger-context-window, 2023. Accessed: 2024-10-18

work page 2023

[57] [57]

Anatomy of a coding assistant, 2023

Quinn Slack. Anatomy of a coding assistant, 2023. URL https://sourcegraph.com/ blog/anatomy-of-a-coding-assistant . Accessed: 2024-10-21

work page 2023

[58] [58]

Generating adversarial computer programs using optimized obfusca- tions, 2021

Shashank Srikant, Sijia Liu, Tamara Mitrovska, Shiyu Chang, Quanfu Fan, Gaoyuan Zhang, and Una-May O’Reilly. Generating adversarial computer programs using optimized obfusca- tions, 2021. URL https://arxiv.org/abs/2103.11882

work page arXiv 2021

[59] [59]

Bigcloneeval: A clone detection tool evaluation frame- work with bigclonebench

Jeffrey Svajlenko and Chanchal K Roy. Bigcloneeval: A clone detection tool evaluation frame- work with bigclonebench. In 2016 IEEE international conference on software maintenance and evolution (ICSME), pages 596–600. IEEE, 2016

work page 2016

[60] [60]

Codestral, May 2024

Mistral AI team. Codestral, May 2024. URL https://mistral.ai/news/codestral. publisher: Mistral AI

work page 2024

[61] [61]

In: Advances in Neural Information Processing Systems, Curran Associates, Inc., vol 34, pp 27,865–27,876,https://proceedings

Zhao Tian, Junjie Chen, and Zhi Jin. Code difference guided adversarial example genera- tion for deep code models. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 850–862, 2023. doi: 10.1109/ASE56229.2023.00149

work page doi:10.1109/ase56229.2023.00149 2023

[62] [62]

Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024

Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini. Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024. URL https://arxiv.org/ abs/2312.12575

work page arXiv 2024

[63] [63]

ReCode: Robustness evaluation of code generation models

Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, Ramesh Nallapati, Murali Krishna Ramanathan, Dan Roth, and Bing Xiang. ReCode: Robustness evaluation of code generation models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Ann...

work page doi:10.18653/v1/2023.acl-long.773 2023

[64] [64]

Detecting code clones with graph neural network and flow-augmented abstract syntax tree

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 261–271. IEEE, 2020

work page 2020

[65] [65]

Bui, Junnan Li, and Steven C

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation.arXiv preprint, 2023. 13

work page 2023

[66] [66]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State- of-the-...

work page 2020

[67] [67]

Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023

Fangzhou Wu, Xiaogeng Liu, and Chaowei Xiao. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions, 2023. URL https://arxiv.org/ abs/2312.04730

work page arXiv 2023

[68] [68]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[69] [69]

Natural attack for pre-trained models of code

Zhou Yang, Jieke Shi, Junda He, and David Lo. Natural attack for pre-trained models of code. In Proceedings of the 44th International Conference on Software Engineering , ICSE ’22, page 1482–1493, New York, NY , USA, 2022. Association for Computing Machinery. ISBN 9781450392211. doi: 10.1145/3510003.3510146. URL https://doi.org/10.1145/ 3510003.3510146

work page doi:10.1145/3510003.3510146 2022

[70] [70]

Adversarial examples for models of code

Noam Yefet, Uri Alon, and Eran Yahav. Adversarial examples for models of code. Proc. ACM Program. Lang., 4(OOPSLA), November 2020. doi: 10.1145/3428230. URL https: //doi.org/10.1145/3428230

work page doi:10.1145/3428230 2020

[71] [71]

An extensive study on pre-trained models for program understanding and generation

Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. An extensive study on pre-trained models for program understanding and generation. In Pro- ceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis, pages 39–51, 2022

work page 2022

[72] [72]

Generating adversarial examples for holding robustness of source code processing models

Huangzhao Zhang, Zhuo Li, Ge Li, Lei Ma, Yang Liu, and Zhi Jin. Generating adversarial examples for holding robustness of source code processing models. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1169–1176, Apr. 2020. doi: 10.1609/aaai.v34i01

work page doi:10.1609/aaai.v34i01 2020

[73] [73]

URL https://ojs.aaai.org/index.php/AAAI/article/view/5469

work page

[74] [74]

Towards robustness of deep program processing models—detection, estimation, and enhancement

Huangzhao Zhang, Zhiyi Fu, Ge Li, Lei Ma, Zhehao Zhao, Hua’an Yang, Yizhe Sun, Yang Liu, and Zhi Jin. Towards robustness of deep program processing models—detection, estimation, and enhancement. ACM Trans. Softw. Eng. Methodol., 31(3), April 2022. ISSN 1049-331X. doi: 10.1145/3511887. URL https://doi.org/10.1145/3511887

work page doi:10.1145/3511887 2022

[75] [75]

A black-box attack on code models via representation nearest neighbor search

Jie Zhang, Wei Ma, Qiang Hu, Shangqing Liu, Xiaofei Xie, Yves Le Traon, and Yang Liu. A black-box attack on code models via representation nearest neighbor search. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Com- putational Linguistics: EMNLP 2023 , pages 9706–9716, Singapore, December 2023. As- sociation for Com...

work page doi:10.18653/v1/2023.findings-emnlp.649 2023

[76] [76]

GitHub Copilot now has a better AI model and new capabil- ities, February 2023

Shuyin Zhao. GitHub Copilot now has a better AI model and new capabil- ities, February 2023. URL https://github.blog/ai-and-ml/github-copilot/ github-copilot-now-has-a-better-ai-model-and-new-capabilities/

work page 2023

[77] [77]

Evolutionary multi-objective optimiza- tion for contextual adversarial example generation.Proc

Shasha Zhou, Mingyu Huang, Yanan Sun, and Ke Li. Evolutionary multi-objective optimiza- tion for contextual adversarial example generation.Proc. ACM Softw. Eng., 1(FSE), July 2024. doi: 10.1145/3660808. URL https://doi.org/10.1145/3660808. 14

work page doi:10.1145/3660808 2024

[78] [78]

(𝑐) 𝑐 𝑔#(𝑐) 𝑔!(𝑔

Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019. A Implementation Transformations. Although the Cayley Graph structure accommodates any semantics-preserving tr...

work page 2019