Automatically Attacking Software Reverse Engineering AI Agents
Pith reviewed 2026-06-29 06:12 UTC · model grok-4.3
The pith
Prompt injections can be hidden in executable binaries to mislead LLM reverse engineering agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By modifying the AutoDAN adversarial attack with a genetic algorithm, the authors generate string assignments that embed surreptitious instructions. When the binary is decompiled, the LLM receives these strings as part of the code and follows the hidden prompts, leading to misinterpretation of the executable's functionality without altering its actual behavior.
What carries the argument
Genetic algorithm search for prompt injections inserted as extraneous string variable assignments that carry instructions to the LLM without affecting executable functionality.
If this is right
- LLM-powered disassembly and decompilation systems can be deceived into producing incorrect analytical output.
- Automated detection systems relying on LLM analysis pipelines can be bypassed by attackers.
- Insights can be gained on the security implications of integrating LLMs into cybersecurity toolchains.
- More robust agentic code analysis systems are needed to resist such injections.
Where Pith is reading between the lines
- Sanitizing or ignoring string literals in decompiled output could mitigate the attack.
- The technique might extend to other LLM agents that process code or text from untrusted sources.
- Empirical testing on various LLMs and decompilers would determine the attack's success rate across different models.
Load-bearing premise
LLM agents will treat the content of string variable assignments in decompiled code as actionable instructions instead of filtering or disregarding them.
What would settle it
Compile a program with the generated string assignments, decompile it, feed the output to an LLM agent, and observe if the agent follows the injected instructions or analyzes the code correctly.
read the original abstract
Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static analysis without having access to original source code. Coupled with the analytic power of large language models (LLM), agentic systems enabled with tools, such as GhidraMCP, can allow analysts to automate a previously human driven process. Although this automation can increase the productivity of a single malware analyst, it also introduces a new area of vulnerability for malware obfuscation. This paper presents an adversarial technique using genetic algorithm-based prompt generation, a modification of an adversarial attack known as AutoDAN, to demonstrate the ability to deceive LLM-powered disassembly and decompilation systems into misinterpreting binary executables, effectively corrupting their analytical output. This proof-of-concept methodology exploits inherent vulnerabilities in how LLMs process and interpret decompiled machine code via prompt injection by using extraneous string variable assignments to pass surreptitious instructions to the LLM while not impacting the functionality of the executable file. We demonstrate this capability through several concise examples. This approach could enable attackers to bypass automated detection systems that rely on LLM-driven analysis pipelines. By studying and understanding this attack, insights can be gained regarding the security implication of integrating LLMs into cybersecurity toolchains and building more robust agentic code analysis systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to demonstrate a proof-of-concept adversarial attack on LLM-powered reverse engineering agents (e.g., GhidraMCP) by modifying the AutoDAN attack with a genetic algorithm. The attack inserts extraneous string variable assignments into source code; these survive compilation and decompilation as prompt injections that cause the LLM to misinterpret binary functionality without changing runtime behavior. The approach is presented as exploiting inherent LLM vulnerabilities in processing decompiled code and is illustrated via several concise examples, with discussion of implications for securing LLM-integrated analysis pipelines.
Significance. If the attack mechanism were shown to work reliably, the result would be significant for AI security and cybersecurity toolchains, as it identifies a concrete prompt-injection vector specific to decompiled code and agentic analysis systems. The work would usefully highlight risks in automated RE and motivate defenses. However, the current lack of systematic evaluation limits its contribution to a preliminary observation rather than a substantiated finding.
major comments (2)
- [Abstract] Abstract: the central claim that the method 'effectively corrupts their analytical output' and enables bypassing of LLM-driven detection rests entirely on 'several concise examples' with no reported success rates, number of trials, controls, failure modes, or comparison baselines. This evidentiary gap is load-bearing for the claim of a viable attack.
- The manuscript (stress-test assumption): the attack requires that injected string literals survive decompilation and are treated by the target LLM agent as actionable instructions rather than data or sanitized content. No validation against realistic agent prompting, output parsing, or sanitization behaviors is provided, leaving the core behavioral premise untested.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback highlighting the need for stronger empirical support. We address each major comment below and commit to revisions that will expand the evaluation while preserving the proof-of-concept focus of the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the method 'effectively corrupts their analytical output' and enables bypassing of LLM-driven detection rests entirely on 'several concise examples' with no reported success rates, number of trials, controls, failure modes, or comparison baselines. This evidentiary gap is load-bearing for the claim of a viable attack.
Authors: We agree that the current evidentiary basis is limited to illustrative examples and that this weakens the central claims. In the revised manuscript we will update the abstract to describe the work more precisely as a proof-of-concept and add a dedicated evaluation section reporting success rates across repeated trials on multiple binaries, controls for code structure, observed failure modes, and simple baseline comparisons. revision: yes
-
Referee: The manuscript (stress-test assumption): the attack requires that injected string literals survive decompilation and are treated by the target LLM agent as actionable instructions rather than data or sanitized content. No validation against realistic agent prompting, output parsing, or sanitization behaviors is provided, leaving the core behavioral premise untested.
Authors: The examples demonstrate string survival through compilation and decompilation together with observable effects on the GhidraMCP agent. We acknowledge that these examples do not cover the full range of agent configurations. We will add experiments that vary prompting strategies, incorporate explicit output parsing steps, and test simple sanitization defenses to provide broader validation of the behavioral premise. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an empirical proof-of-concept for a genetic-algorithm-modified AutoDAN attack that injects instructions via extraneous string assignments in binaries targeting LLM reverse-engineering agents. No equations, fitted parameters, or derivation chains appear; the central claim is an independent empirical demonstration of prompt injection surviving compilation/decompilation. No self-citations are load-bearing, no uniqueness theorems are imported, and no ansatzes or renamings reduce the result to its inputs by construction. The attack relies on external assumptions about LLM context processing that are tested via POC examples rather than derived circularly.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ini0al dissembler tools allowed analysts to convert the machine code of an executable binary into the slightly less tedious assembly language presenta0on
Introduc'on The rise of tools for reverse engineering executable binary files enhanced the ability of soFware analysts to determine the func0on of unknown programs through sta0c analysis. Ini0al dissembler tools allowed analysts to convert the machine code of an executable binary into the slightly less tedious assembly language presenta0on. More recently, ...
2024
-
[2]
Methodology This aPack is based on ideas from what Thomas (2025) calls a transcript hack, where he shows, through a number guessing game illustra0on, that if a model is passed what appears to be the transcript for the current conversa0on’s history, it will believe that the provided informa0on is accurate to previous turns of the conversa0on. The format fo...
2025
-
[3]
These strings serve to lead the LLM to act as though the decompile func0on data immediately before and aFer the adversarial string contains inaccuracies and that only the target code string should be evaluated in determining the purpose of the func0on. The target code string begins with the text that re-calls the decompile_func0on tool, a task originally ...
2048
-
[4]
i'm sorry
In AutoDAN, the gene0c algorithm samples from a seed list of prepend and append string pairs running each through the model and assessing the fitness of each pair. Similar to Equa0on 1, the fitness is determined by calcula0ng the loss, but instead of looking at the likelihood of a single token given the previous text, the AudoDAN fitness assessment calculate...
2023
-
[5]
APack files were crea0ng using Qwen3-8B and tested against both Qwen3-8B and GPT-OSS-120B (Agarwal et al., 2025)
Experiments and Results Experiments were conducted on four executable binary files: two containing a single main func0on and two containing a main func0on and another func0on. APack files were crea0ng using Qwen3-8B and tested against both Qwen3-8B and GPT-OSS-120B (Agarwal et al., 2025). 3.1 Single Func'on Files The AutoDAN algorithm using Qwen3-8B found a...
2025
-
[6]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925. Bandi, A., Kongari, B., Naguru, R., Pasnoor, S. and Vilipala, S.V.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Future Internet, 17(9), p.404
The rise of agen0c ai: A review of defini0ons, frameworks, architectures, applica0ons, evalua0on metrics, and challenges. Future Internet, 17(9), p.404. Chen, X., Zhou, A., Ye, C. and Zhang, C., 2025, October. ClearAgent: Agen0c Binary Analysis for Effec0ve Vulnerability Detec0on. In Proceedings of the 1st ACM SIGPLAN Interna6onal Workshop on Language Model...
2025
-
[8]
(ed.) The 2026 Guide to Prompt Engineering [online] Available at: hPps://www.ibm.com/think/topics/prompt-injec0on (Accessed: 24 February 2026)
‘What is a Prompt Injec0on APack?’, Gadesha, V. (ed.) The 2026 Guide to Prompt Engineering [online] Available at: hPps://www.ibm.com/think/topics/prompt-injec0on (Accessed: 24 February 2026). Liu, Y ., Deng, G., Li, Y ., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y ., Wang, H., Zheng, Y . and Liu, Y .,
2026
-
[9]
Prompt Injection attack against LLM-integrated Applications
Prompt injec0on aPack against llm-integrated applica0ons. arXiv preprint arXiv:2306.05499. Liu, X., Xu, N., Chen, M. and Xiao, C.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
Autodan: Genera0ng stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451. Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P ., Neelakantan, A., Shyam, P ., Sastry, G., Askell, A. and Agarwal, S.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Language Models are Few-Shot Learners
Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 1(3), p.3. Marzouk, A.,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[12]
Ray, P .P .,
‘IDEsaster: A Novel Vulnerability Class in AI IDEs’, MaccariTA [online] Available at: hPps://maccarita.com/posts/idesaster (Accessed: 29 January 2026). Ray, P .P .,
2026
-
[13]
Vassilev, A., Oprea, A., Fordyce, A
‘Why Smart Instruc0on-Following Makes Prompt Injec0on Easier’, Giles’ Blog [online] Available at: hPps://www.gilesthomas.com/2025/11 (Accessed: 17 December 2025). Vassilev, A., Oprea, A., Fordyce, A. and Andersen, H.,
2025
-
[14]
Qwen3 technical report. arXiv preprint arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.