Automatically Attacking Software Reverse Engineering AI Agents

Brian Crawford; Justin Phillips; Patrick McClure

arxiv: 2605.30667 · v1 · pith:6MX6IIXLnew · submitted 2026-05-28 · 💻 cs.CR · cs.AI

Automatically Attacking Software Reverse Engineering AI Agents

Brian Crawford , Justin Phillips , Patrick McClure This is my paper

Pith reviewed 2026-06-29 06:12 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt injectionLLM agentsreverse engineeringadversarial attackdecompilationmalware analysisgenetic algorithmbinary obfuscation

0 comments

The pith

Prompt injections can be hidden in executable binaries to mislead LLM reverse engineering agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attackers can use a genetic algorithm to insert prompt injections into source code as string variable assignments. These assignments do not change the program's behavior but survive compilation and decompilation to deliver instructions to LLM agents analyzing the binary. This reveals a vulnerability in automated tools that combine decompilers with LLMs for malware analysis. If the method works, it allows bypassing of LLM-based detection systems by corrupting their output. The approach adapts an existing adversarial technique to the domain of binary reverse engineering.

Core claim

By modifying the AutoDAN adversarial attack with a genetic algorithm, the authors generate string assignments that embed surreptitious instructions. When the binary is decompiled, the LLM receives these strings as part of the code and follows the hidden prompts, leading to misinterpretation of the executable's functionality without altering its actual behavior.

What carries the argument

Genetic algorithm search for prompt injections inserted as extraneous string variable assignments that carry instructions to the LLM without affecting executable functionality.

If this is right

LLM-powered disassembly and decompilation systems can be deceived into producing incorrect analytical output.
Automated detection systems relying on LLM analysis pipelines can be bypassed by attackers.
Insights can be gained on the security implications of integrating LLMs into cybersecurity toolchains.
More robust agentic code analysis systems are needed to resist such injections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sanitizing or ignoring string literals in decompiled output could mitigate the attack.
The technique might extend to other LLM agents that process code or text from untrusted sources.
Empirical testing on various LLMs and decompilers would determine the attack's success rate across different models.

Load-bearing premise

LLM agents will treat the content of string variable assignments in decompiled code as actionable instructions instead of filtering or disregarding them.

What would settle it

Compile a program with the generated string assignments, decompile it, feed the output to an LLM agent, and observe if the agent follows the injected instructions or analyzes the code correctly.

read the original abstract

Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static analysis without having access to original source code. Coupled with the analytic power of large language models (LLM), agentic systems enabled with tools, such as GhidraMCP, can allow analysts to automate a previously human driven process. Although this automation can increase the productivity of a single malware analyst, it also introduces a new area of vulnerability for malware obfuscation. This paper presents an adversarial technique using genetic algorithm-based prompt generation, a modification of an adversarial attack known as AutoDAN, to demonstrate the ability to deceive LLM-powered disassembly and decompilation systems into misinterpreting binary executables, effectively corrupting their analytical output. This proof-of-concept methodology exploits inherent vulnerabilities in how LLMs process and interpret decompiled machine code via prompt injection by using extraneous string variable assignments to pass surreptitious instructions to the LLM while not impacting the functionality of the executable file. We demonstrate this capability through several concise examples. This approach could enable attackers to bypass automated detection systems that rely on LLM-driven analysis pipelines. By studying and understanding this attack, insights can be gained regarding the security implication of integrating LLMs into cybersecurity toolchains and building more robust agentic code analysis systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies AutoDAN-style prompt injection via surviving string literals to LLM reverse-engineering agents as a proof-of-concept, but the evidence is only a handful of examples with no rates or controls.

read the letter

The paper takes the existing AutoDAN genetic algorithm attack and adapts it to insert extraneous string assignments into binaries. These strings are supposed to persist through compilation and decompilation so an LLM agent like GhidraMCP treats them as hidden instructions that corrupt its analysis of the malware.

It does a reasonable job of spotting a real attack surface in the new class of LLM-augmented decompiler tools. The injection method is straightforward and keeps the binary functional, which is the right property for an obfuscation technique.

The main weakness is that the central claim rests on "several concise examples" with no success rates, no comparison baselines, no tests against different LLMs or agent configurations, and no discussion of how often the genetic search actually produces injections that the model follows rather than ignores or sanitizes. The assumption that decompiled string literals will be read as executable directives is plausible but untested in the description.

This is for people working on AI security for code analysis pipelines. A reader already thinking about prompt injection or malware evasion might pick up the specific vector, but the lack of quantitative results limits how far the claim can be taken.

It deserves peer review if the authors add proper experiments and failure analysis; the idea is worth checking even if the current version is preliminary.

Referee Report

2 major / 0 minor

Summary. The manuscript claims to demonstrate a proof-of-concept adversarial attack on LLM-powered reverse engineering agents (e.g., GhidraMCP) by modifying the AutoDAN attack with a genetic algorithm. The attack inserts extraneous string variable assignments into source code; these survive compilation and decompilation as prompt injections that cause the LLM to misinterpret binary functionality without changing runtime behavior. The approach is presented as exploiting inherent LLM vulnerabilities in processing decompiled code and is illustrated via several concise examples, with discussion of implications for securing LLM-integrated analysis pipelines.

Significance. If the attack mechanism were shown to work reliably, the result would be significant for AI security and cybersecurity toolchains, as it identifies a concrete prompt-injection vector specific to decompiled code and agentic analysis systems. The work would usefully highlight risks in automated RE and motivate defenses. However, the current lack of systematic evaluation limits its contribution to a preliminary observation rather than a substantiated finding.

major comments (2)

[Abstract] Abstract: the central claim that the method 'effectively corrupts their analytical output' and enables bypassing of LLM-driven detection rests entirely on 'several concise examples' with no reported success rates, number of trials, controls, failure modes, or comparison baselines. This evidentiary gap is load-bearing for the claim of a viable attack.
The manuscript (stress-test assumption): the attack requires that injected string literals survive decompilation and are treated by the target LLM agent as actionable instructions rather than data or sanitized content. No validation against realistic agent prompting, output parsing, or sanitization behaviors is provided, leaving the core behavioral premise untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback highlighting the need for stronger empirical support. We address each major comment below and commit to revisions that will expand the evaluation while preserving the proof-of-concept focus of the work.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'effectively corrupts their analytical output' and enables bypassing of LLM-driven detection rests entirely on 'several concise examples' with no reported success rates, number of trials, controls, failure modes, or comparison baselines. This evidentiary gap is load-bearing for the claim of a viable attack.

Authors: We agree that the current evidentiary basis is limited to illustrative examples and that this weakens the central claims. In the revised manuscript we will update the abstract to describe the work more precisely as a proof-of-concept and add a dedicated evaluation section reporting success rates across repeated trials on multiple binaries, controls for code structure, observed failure modes, and simple baseline comparisons. revision: yes
Referee: The manuscript (stress-test assumption): the attack requires that injected string literals survive decompilation and are treated by the target LLM agent as actionable instructions rather than data or sanitized content. No validation against realistic agent prompting, output parsing, or sanitization behaviors is provided, leaving the core behavioral premise untested.

Authors: The examples demonstrate string survival through compilation and decompilation together with observable effects on the GhidraMCP agent. We acknowledge that these examples do not cover the full range of agent configurations. We will add experiments that vary prompting strategies, incorporate explicit output parsing steps, and test simple sanitization defenses to provide broader validation of the behavioral premise. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical proof-of-concept for a genetic-algorithm-modified AutoDAN attack that injects instructions via extraneous string assignments in binaries targeting LLM reverse-engineering agents. No equations, fitted parameters, or derivation chains appear; the central claim is an independent empirical demonstration of prompt injection surviving compilation/decompilation. No self-citations are load-bearing, no uniqueness theorems are imported, and no ansatzes or renamings reduce the result to its inputs by construction. The attack relies on external assumptions about LLM context processing that are tested via POC examples rather than derived circularly.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work rests on the unstated premise that LLMs will execute injected instructions from decompiled strings.

pith-pipeline@v0.9.1-grok · 5752 in / 1056 out tokens · 28702 ms · 2026-06-29T06:12:14.755326+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 5 canonical work pages · 5 internal anchors

[1]

Ini0al dissembler tools allowed analysts to convert the machine code of an executable binary into the slightly less tedious assembly language presenta0on

Introduc'on The rise of tools for reverse engineering executable binary ﬁles enhanced the ability of soFware analysts to determine the func0on of unknown programs through sta0c analysis. Ini0al dissembler tools allowed analysts to convert the machine code of an executable binary into the slightly less tedious assembly language presenta0on. More recently, ...

2024
[2]

Methodology This aPack is based on ideas from what Thomas (2025) calls a transcript hack, where he shows, through a number guessing game illustra0on, that if a model is passed what appears to be the transcript for the current conversa0on’s history, it will believe that the provided informa0on is accurate to previous turns of the conversa0on. The format fo...

2025
[3]

These strings serve to lead the LLM to act as though the decompile func0on data immediately before and aFer the adversarial string contains inaccuracies and that only the target code string should be evaluated in determining the purpose of the func0on. The target code string begins with the text that re-calls the decompile_func0on tool, a task originally ...

2048
[4]

i'm sorry

In AutoDAN, the gene0c algorithm samples from a seed list of prepend and append string pairs running each through the model and assessing the ﬁtness of each pair. Similar to Equa0on 1, the ﬁtness is determined by calcula0ng the loss, but instead of looking at the likelihood of a single token given the previous text, the AudoDAN ﬁtness assessment calculate...

2023
[5]

APack ﬁles were crea0ng using Qwen3-8B and tested against both Qwen3-8B and GPT-OSS-120B (Agarwal et al., 2025)

Experiments and Results Experiments were conducted on four executable binary ﬁles: two containing a single main func0on and two containing a main func0on and another func0on. APack ﬁles were crea0ng using Qwen3-8B and tested against both Qwen3-8B and GPT-OSS-120B (Agarwal et al., 2025). 3.1 Single Func'on Files The AutoDAN algorithm using Qwen3-8B found a...

2025
[6]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Bandi, A., Kongari, B., Naguru, R., Pasnoor, S. and Vilipala, S.V.,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Future Internet, 17(9), p.404

The rise of agen0c ai: A review of deﬁni0ons, frameworks, architectures, applica0ons, evalua0on metrics, and challenges. Future Internet, 17(9), p.404. Chen, X., Zhou, A., Ye, C. and Zhang, C., 2025, October. ClearAgent: Agen0c Binary Analysis for Eﬀec0ve Vulnerability Detec0on. In Proceedings of the 1st ACM SIGPLAN Interna6onal Workshop on Language Model...

2025
[8]

(ed.) The 2026 Guide to Prompt Engineering [online] Available at: hPps://www.ibm.com/think/topics/prompt-injec0on (Accessed: 24 February 2026)

‘What is a Prompt Injec0on APack?’, Gadesha, V. (ed.) The 2026 Guide to Prompt Engineering [online] Available at: hPps://www.ibm.com/think/topics/prompt-injec0on (Accessed: 24 February 2026). Liu, Y ., Deng, G., Li, Y ., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y ., Wang, H., Zheng, Y . and Liu, Y .,

2026
[9]

Prompt Injection attack against LLM-integrated Applications

Prompt injec0on aPack against llm-integrated applica0ons. arXiv preprint arXiv:2306.05499. Liu, X., Xu, N., Chen, M. and Xiao, C.,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Genera0ng stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451. Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A. and Agarwal, S.,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Language Models are Few-Shot Learners

Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1(3), p.3. Marzouk, A.,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[12]

Ray, P .P .,

‘IDEsaster: A Novel Vulnerability Class in AI IDEs’, MaccariTA [online] Available at: hPps://maccarita.com/posts/idesaster (Accessed: 29 January 2026). Ray, P .P .,

2026
[13]

Vassilev, A., Oprea, A., Fordyce, A

‘Why Smart Instruc0on-Following Makes Prompt Injec0on Easier’, Giles’ Blog [online] Available at: hPps://www.gilesthomas.com/2025/11 (Accessed: 17 December 2025). Vassilev, A., Oprea, A., Fordyce, A. and Andersen, H.,

2025
[14]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Ini0al dissembler tools allowed analysts to convert the machine code of an executable binary into the slightly less tedious assembly language presenta0on

Introduc'on The rise of tools for reverse engineering executable binary ﬁles enhanced the ability of soFware analysts to determine the func0on of unknown programs through sta0c analysis. Ini0al dissembler tools allowed analysts to convert the machine code of an executable binary into the slightly less tedious assembly language presenta0on. More recently, ...

2024

[2] [2]

Methodology This aPack is based on ideas from what Thomas (2025) calls a transcript hack, where he shows, through a number guessing game illustra0on, that if a model is passed what appears to be the transcript for the current conversa0on’s history, it will believe that the provided informa0on is accurate to previous turns of the conversa0on. The format fo...

2025

[3] [3]

These strings serve to lead the LLM to act as though the decompile func0on data immediately before and aFer the adversarial string contains inaccuracies and that only the target code string should be evaluated in determining the purpose of the func0on. The target code string begins with the text that re-calls the decompile_func0on tool, a task originally ...

2048

[4] [4]

i'm sorry

In AutoDAN, the gene0c algorithm samples from a seed list of prepend and append string pairs running each through the model and assessing the ﬁtness of each pair. Similar to Equa0on 1, the ﬁtness is determined by calcula0ng the loss, but instead of looking at the likelihood of a single token given the previous text, the AudoDAN ﬁtness assessment calculate...

2023

[5] [5]

APack ﬁles were crea0ng using Qwen3-8B and tested against both Qwen3-8B and GPT-OSS-120B (Agarwal et al., 2025)

Experiments and Results Experiments were conducted on four executable binary ﬁles: two containing a single main func0on and two containing a main func0on and another func0on. APack ﬁles were crea0ng using Qwen3-8B and tested against both Qwen3-8B and GPT-OSS-120B (Agarwal et al., 2025). 3.1 Single Func'on Files The AutoDAN algorithm using Qwen3-8B found a...

2025

[6] [6]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Bandi, A., Kongari, B., Naguru, R., Pasnoor, S. and Vilipala, S.V.,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Future Internet, 17(9), p.404

The rise of agen0c ai: A review of deﬁni0ons, frameworks, architectures, applica0ons, evalua0on metrics, and challenges. Future Internet, 17(9), p.404. Chen, X., Zhou, A., Ye, C. and Zhang, C., 2025, October. ClearAgent: Agen0c Binary Analysis for Eﬀec0ve Vulnerability Detec0on. In Proceedings of the 1st ACM SIGPLAN Interna6onal Workshop on Language Model...

2025

[8] [8]

(ed.) The 2026 Guide to Prompt Engineering [online] Available at: hPps://www.ibm.com/think/topics/prompt-injec0on (Accessed: 24 February 2026)

‘What is a Prompt Injec0on APack?’, Gadesha, V. (ed.) The 2026 Guide to Prompt Engineering [online] Available at: hPps://www.ibm.com/think/topics/prompt-injec0on (Accessed: 24 February 2026). Liu, Y ., Deng, G., Li, Y ., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y ., Wang, H., Zheng, Y . and Liu, Y .,

2026

[9] [9]

Prompt Injection attack against LLM-integrated Applications

Prompt injec0on aPack against llm-integrated applica0ons. arXiv preprint arXiv:2306.05499. Liu, X., Xu, N., Chen, M. and Xiao, C.,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Genera0ng stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451. Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A. and Agarwal, S.,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Language Models are Few-Shot Learners

Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1(3), p.3. Marzouk, A.,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[12] [12]

Ray, P .P .,

‘IDEsaster: A Novel Vulnerability Class in AI IDEs’, MaccariTA [online] Available at: hPps://maccarita.com/posts/idesaster (Accessed: 29 January 2026). Ray, P .P .,

2026

[13] [13]

Vassilev, A., Oprea, A., Fordyce, A

‘Why Smart Instruc0on-Following Makes Prompt Injec0on Easier’, Giles’ Blog [online] Available at: hPps://www.gilesthomas.com/2025/11 (Accessed: 17 December 2025). Vassilev, A., Oprea, A., Fordyce, A. and Andersen, H.,

2025

[14] [14]

Qwen3 Technical Report

Qwen3 technical report. arXiv preprint arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv