pith. machine review for the scientific record. sign in

arxiv: 2604.14317 · v1 · submitted 2026-04-15 · 💻 cs.CR · cs.AI

Recognition: unknown

Challenges and Future Directions in Agentic Reverse Engineering Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords agentic systemsreverse engineeringbinary analysisLLM agentssecurity challengesobfuscationfuture directions
0
0 comments X

The pith

Agentic systems for binary reverse engineering still struggle with obfuscation, timing, and unique architectures despite recent advances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language model-based agentic systems perform on reverse engineering tasks using static, dynamic, and hybrid approaches. It identifies key limitations such as token constraints, difficulties handling obfuscated code, and absence of program guardrails. A sympathetic reader would care because these systems are increasingly applied to security-critical tasks like binary analysis, and understanding their failures points to needed improvements for reliable use in real-world settings. The authors position future directions for overcoming these from a security perspective.

Core claim

Through analysis of existing agentic tool usage in reverse engineering, the paper finds that cutting-edge systems continue to fail in complex scenarios involving obfuscation, timing, and unique architectures. The examination covers static, dynamic, and hybrid agents and highlights limitations including token constraints, struggles with obfuscation, and a lack of program guardrails, leading to outlined challenges and future directions for system designers.

What carries the argument

Analysis of agentic tool usage across static, dynamic, and hybrid agents for binary reverse engineering tasks.

Load-bearing premise

The analysis of existing agentic tool usage captures the primary and representative limitations across realistic reverse engineering settings.

What would settle it

Demonstration of an agentic system that successfully performs reverse engineering on obfuscated binaries with unique architectures without hitting token limits or requiring manual guardrails.

Figures

Figures reproduced from arXiv: 2604.14317 by Jack West, Kassem Fawaz, Salem Radey.

Figure 1
Figure 1. Figure 1: An overview of agent capabilities by type. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Agentic systems built on large language models (LLMs) are increasingly being used for complex security tasks, including binary reverse engineering (RE). Despite recent growth in popularity and capability, these systems continue to face limitations in realistic settings. Cutting-edge systems still fail in complex RE scenarios that involve obfuscation, timing, and unique architecture. In this work, we examine how agentic systems perform reverse engineering tasks with static, dynamic, and hybrid agents. Through an analysis of existing agentic tool usage, we identify several limitations, including token constraints, struggles with obfuscation, and a lack of program guardrails. From these findings, we outline current challenges and position future directions for system designers to overcome from a security perspective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper is a position piece that analyzes the use of LLM-based agentic systems for binary reverse engineering tasks via static, dynamic, and hybrid agent approaches. Drawing on a qualitative review of existing tool usage, it identifies limitations such as token constraints, struggles with code obfuscation, timing dependencies, unique architectures, and insufficient program guardrails. These observations motivate a discussion of current challenges and proposed future directions for more secure and effective agentic RE systems.

Significance. If the limitations identified are broadly representative, the paper provides a timely synthesis of gaps in an emerging area at the intersection of AI and security. Its value lies in framing concrete challenges (obfuscation handling, guardrails) as motivation for future work rather than claiming new empirical results; this can help guide system designers toward more robust designs. The observational approach is appropriate for a position paper and avoids overclaiming.

minor comments (3)
  1. The abstract and introduction would benefit from a brief statement of the scope and selection criteria for the 'existing agentic tool usage' reviewed, to allow readers to evaluate potential selection bias in the identified limitations.
  2. Claims about failures in scenarios involving obfuscation, timing, and unique architectures are central but presented at a high level; adding one or two concrete, cited examples from the reviewed systems would strengthen the motivation without altering the position-piece nature.
  3. The future-directions section could more explicitly link each proposed direction back to the specific limitations enumerated earlier, improving traceability for readers.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our position paper and for recommending minor revision. We appreciate the recognition that the observational approach is appropriate for this type of work and that the synthesis of limitations can help guide future system design.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an observational position paper whose central claims derive from a qualitative review of external agentic RE tools and literature. Limitations (token constraints, obfuscation struggles, missing guardrails) are listed as direct observations from that review rather than from any fitted parameters, self-referential predictions, or equations. No derivation chain, uniqueness theorem, or ansatz is invoked; future directions follow logically from the enumerated challenges without requiring the analysis to be exhaustive or statistically representative. The paper contains no self-citation load-bearing steps or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on domain assumptions about the current state of LLM agent capabilities in security tasks, with no free parameters, invented entities, or ad-hoc axioms introduced beyond standard expectations for agentic systems.

axioms (1)
  • domain assumption LLM-based agents can be meaningfully evaluated on reverse engineering tasks via static, dynamic, and hybrid modes.
    Invoked in the abstract when describing the examination of agent performance.

pith-pipeline@v0.9.0 · 5412 in / 1181 out tokens · 31284 ms · 2026-05-10T12:46:34.483416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    A survey on agentic security: Applications, threats and defenses,

    A. Shahriar, M. N. Rahman, S. Ahmed, F. Sadeque, and M. R. Parvez, “A survey on agentic security: Applications, threats and defenses,” arXiv preprint arXiv:2510.06445, 2025

  2. [2]

    Hacksynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing.ArXiv, abs/2412.01778, dec 2024

    L. Muzsai, D. Imolai, and A. Luk ´acs, “Hacksynth: Llm agent and evaluation framework for autonomous penetration testing,” 2024. [Online]. Available: https://arxiv.org/abs/2412.01778

  3. [3]

    Humans welcome to observe

    X. He, D. Wu, Y . Zhai, and K. Sun, “SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems,” May 2025, arXiv:2505.24201 [cs]. [Online]. Available: http://arxiv.org/abs/2505.24201

  4. [4]

    Clearagent: Agentic binary analysis for effective vulnerability detection,

    X. Chen, A. Zhou, C. Ye, and C. Zhang, “Clearagent: Agentic binary analysis for effective vulnerability detection,” inProceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages, 2025, pp. 130–137

  5. [5]

    Clang Static Analyzer — clang-analyzer.llvm.org,

    “Clang Static Analyzer — clang-analyzer.llvm.org,” https://clang- analyzer.llvm.org/

  6. [6]

    A picture is worth 500 labels: A case study of demographic dispar- ities in local machine learning models for instagram and tiktok,

    J. West, L. Thiemt, S. Ahmed, M. Bartig, K. Fawaz, and S. Banerjee, “A picture is worth 500 labels: A case study of demographic dispar- ities in local machine learning models for instagram and tiktok,” in 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024, pp. 369–387

  7. [7]

    Jnfuzz-droid: a lightweight fuzzing and taint analysis framework for native code of android applications,

    J. Cao, F. Guo, and Y . Qu, “Jnfuzz-droid: a lightweight fuzzing and taint analysis framework for native code of android applications,” Empirical Software Engineering, vol. 30, no. 5, p. 113, 2025

  8. [8]

    PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing

    G. Deng, Y . Liu, A. Robotics, A.-A.-U. Klagenfurt, P. Liu, Y . Li, T. Zhang, Y . Liu, A.-A.-U. Klagenfurt, and S. Rass, “PentestGPt: Evaluating and Harnessing Large Language Models for Automated Penetration Testing.”

  9. [9]

    CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models,

    R. Ghosh, H.-M. v. Stockhausen, M. Schmitt, G. M. Vasile, S. K. Karn, and O. Farri, “CVE-LLM: Ontology-Assisted Automatic Vulnerability Evaluation Using Large Language Models,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 28, pp. 28 757–28 765, Apr. 2025. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/35139

  10. [10]

    On the decidability of disassembling binaries,

    D. Engel, F. Verbeek, and B. Ravindran, “On the decidability of disassembling binaries,” inInternational Symposium on Theoretical Aspects of Software Engineering. Springer, 2024, pp. 127–145

  11. [11]

    Lamd: Context-driven android malware detection and classification with llms,

    X. Qian, X. Zheng, Y . He, S. Yang, and L. Cavallaro, “Lamd: Context-driven android malware detection and classification with llms,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13055

  12. [12]

    Llm4decompile: Decompiling binary code with large language models,

    H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decompiling binary code with large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024, p. 3473–3487. [On- line]. Available: http://dx.doi.org/10.18653/v1/2024.emnlp-main.203

  13. [13]

    Quantifying and mitigating the impact of obfuscations on machine- learning-based decompilation improvement,

    L. Dramko, D. B ¨ol¨oni-Turgut, C. Le Goues, and E. Schwartz, “Quantifying and mitigating the impact of obfuscations on machine- learning-based decompilation improvement,” inInternational Con- ference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2025, pp. 244–266

  14. [14]

    Binary Diff Summarization using Large Language Models,

    M. Udeshi, V . S. C. Putrevu, P. Krishnamurthy, P. Anantharaman, S. Carrick, R. Karri, and F. Khorrami, “Binary Diff Summarization using Large Language Models,” Sep. 2025, arXiv:2509.23970 [cs]. [Online]. Available: http://arxiv.org/abs/2509.23970

  15. [15]

    Cyber-Zero/enigma-plus/config/commands/debug.sh at main · amazon-science/Cyber-Zero — github.com,

    “Cyber-Zero/enigma-plus/config/commands/debug.sh at main · amazon-science/Cyber-Zero — github.com,” https://github.com/amazon-science/Cyber-Zero/blob/main/enigma- plus/config/commands/debug.sh

  16. [16]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer interfaces enable automated software engineering,” 2024. [Online]. Available: https://arxiv.org/abs/2405.15793

  17. [17]

    GitHub - radareorg/radare2: UNIX-like reverse engineering framework and command-line toolset — github.com,

    “GitHub - radareorg/radare2: UNIX-like reverse engineering framework and command-line toolset — github.com,” https://github.com/radareorg/radare2

  18. [18]

    Frida • A world-class dynamic instrumentation toolkit — frida.re,

    “Frida • A world-class dynamic instrumentation toolkit — frida.re,” https://frida.re/

  19. [19]

    dynamorio.org,

    “dynamorio.org,” https://dynamorio.org/

  20. [21]

    Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM- driven Cyberattacks,

    D. Pasquini, E. M. Kornaropoulos, and G. Ateniese, “Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM- driven Cyberattacks,” Nov. 2024, arXiv:2410.20911 [cs]. [Online]. Available: http://arxiv.org/abs/2410.20911

  21. [22]

    Malware dynamic analysis evasion techniques: A survey,

    A. Afianian, S. Niksefat, B. Sadeghiyan, and D. Baptiste, “Malware dynamic analysis evasion techniques: A survey,”ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1–28, 2019

  22. [23]

    Multi-agent systems execute arbitrary malicious code.arXiv preprint arXiv:2503.12188, 2025

    H. Triedman, R. Jha, and V . Shmatikov, “Multi-agent systems execute arbitrary malicious code,” 2025. [Online]. Available: https://arxiv.org/abs/2503.12188

  23. [24]

    What your firmware tells you is not how you should emulate it: A specification- guided approach for firmware emulation,

    W. Zhou, L. Zhang, L. Guan, P. Liu, and Y . Zhang, “What your firmware tells you is not how you should emulate it: A specification- guided approach for firmware emulation,” inProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, pp. 3269–3283

  24. [25]

    Pentestagent: Incorporating llm agents to automated penetration testing,

    X. Shen, L. Wang, Z. Li, Y . Chen, W. Zhao, D. Sun, J. Wang, and W. Ruan, “Pentestagent: Incorporating llm agents to automated penetration testing,” inProceedings of the 20th ACM Asia Conference on Computer and Communications Security, 2025, pp. 375–391

  25. [26]

    GitHub - NationalSecurityAgency/ghidra: Ghidra is a soft- ware reverse engineering (SRE) framework — github.com,

    “GitHub - NationalSecurityAgency/ghidra: Ghidra is a soft- ware reverse engineering (SRE) framework — github.com,” https://github.com/NationalSecurityAgency/ghidra

  26. [27]

    Hex-Rays, https://hex-rays.com/ida-pro

  27. [28]

    Large language models: applications, limitations, challenges, and recommendations in cybersecurity, digital forensics, and ethical hacking,

    J. P. A. Yaacoub, H. N. Noura, O. Salman, and G. Pujolle, “Large language models: applications, limitations, challenges, and recommendations in cybersecurity, digital forensics, and ethical hacking,”Annals of Telecommunications, Nov. 2025. [Online]. Available: https://doi.org/10.1007/s12243-025-01134-9

  28. [29]

    Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support

    S. Ren and S. Chen, “Large Language Models for Cybersecurity Intelligence, Threat Hunting, and Decision Support.”

  29. [30]

    Sok: Potentials and challenges of large language models for reverse engineering,

    X. Hu, Z. Fu, S. Xie, S. H. H. Ding, and P. Charland, “SoK: Potentials and Challenges of Large Language Models for Reverse Engineering,” Sep. 2025, arXiv:2509.21821 [cs]. [Online]. Available: http://arxiv.org/abs/2509.21821

  30. [31]

    CompileAgent: Automated Real-World Repo- Level Compilation with Tool-Integrated LLM-based Agent System

    L. Hu, G. Chen, X. Shang, S. Cheng, B. Wu, G. Li, X. Zhu, W. Zhang, and N. Yu, “CompileAgent: Automated Real-World Repo- Level Compilation with Tool-Integrated LLM-based Agent System.”

  31. [32]

    Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025

    G. Chen, H. Sun, D. Liu, Z. Wang, Q. Wang, B. Yin, L. Liu, and L. Ying, “ReCopilot: Reverse Engineering Copilot in Binary Analysis,” May 2025, arXiv:2505.16366 [cs]. [Online]. Available: http://arxiv.org/abs/2505.16366

  32. [33]

    Vulnbot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.ArXiv, abs/2501.13411, jan 2025

    H. Kong, D. Hu, J. Ge, L. Li, T. Li, and B. Wu, “VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework,” Jan. 2025, arXiv:2501.13411 [cs]. [Online]. Available: http://arxiv.org/abs/2501.13411

  33. [34]

    IRCopilot: Automated Incident Response with Large Language Models

    X. Lin, J. Zhang, G. Deng, T. Liu, T. Zhang, Q. Guo, and R. Chen, “IRCopilot: Automated Incident Response with Large Language Models,” Oct. 2025, arXiv:2505.20945 [cs]. [Online]. Available: http://arxiv.org/abs/2505.20945

  34. [36]

    Cybersecurity ai: Evaluating agentic cybersecurity in attack/defense ctfs.arXiv preprint arXiv:2510.17521, 2025

    F. Balassone, V . Mayoral-Vilches, S. Rass, M. Pinzger, G. Perrone, S. P. Romano, and P. Schartner, “Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs,” Oct. 2025, arXiv:2510.17521 [cs]. [Online]. Available: http://arxiv.org/abs/2510.17521

  35. [37]

    Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,

    Y . Wang, X. Xu, X. Zhu, X. Gu, and B. Shen, “Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14646

  36. [38]

    Disassembling obfuscated executables with llm,

    H. Rong, Y . Duan, H. Zhang, X. Wang, H. Chen, S. Duan, and S. Wang, “Disassembling obfuscated executables with llm,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08924

  37. [39]

    Wadec: Decompiling webassembly using large language model,

    X. She, Y . Zhao, and H. Wang, “Wadec: Decompiling webassembly using large language model,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, p. 481–492. [Online]. Available: https://doi.org/10.1145/3691620.3695020

  38. [40]

    GitHub - skylot/jadx: Dex to Java decompiler — github.com,

    Skylot, “GitHub - skylot/jadx: Dex to Java decompiler — github.com,” https://github.com/skylot/jadx

  39. [41]

    Application fundamentals — App architecture — Android Developers — developer.android.com,

    “Application fundamentals — App architecture — Android Developers — developer.android.com,” https://developer.android.com/guide/components/fundamentals

  40. [42]

    Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025

    X. Shang, G. Chen, S. Cheng, B. Wu, L. Hu, G. Li, W. Zhang, and N. Yu, “Binmetric: A comprehensive binary analysis benchmark for large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.07360

  41. [43]

    Exploration, analysis, and manipulation of source code using srcml,

    J. I. Maletic and M. L. Collard, “Exploration, analysis, and manipulation of source code using srcml,” May 2015. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.302

  42. [44]

    Decompiling the synergy: An empirical study of human–llm teaming in software reverse engineering

    Z. L. Basque, S. Doria, A. Soneji, W. Gibbs, A. Doup ´e, Y . Shoshi- taishvili, E. Losiouk, R. Wang, and S. Aonzo, “Decompiling the synergy: An empirical study of human–llm teaming in software reverse engineering.”

  43. [45]

    GDB: The GNU Project Debugger — sourceware.org,

    “GDB: The GNU Project Debugger — sourceware.org,” https://www.sourceware.org/gdb/

  44. [46]

    Cyber-zero: Training cybersecurity agents without runtime

    T. Y . Zhuo, D. Wang, H. Ding, V . Kumar, and Z. Wang, “Cyber-zero: Training cybersecurity agents without runtime,” 2025. [Online]. Available: https://arxiv.org/abs/2508.00910

  45. [47]

    Jimenez, Farshad Khorrami, Prashanth Krishnamurthy, Brendan Dolan-Gavitt, Muhammad Shafique, Karthik Narasimhan, Ramesh Karri, and Ofir Press

    T. Abramovich, M. Udeshi, M. Shao, K. Lieret, H. Xi, K. Milner, S. Jancheska, J. Yang, C. E. Jimenez, F. Khorrami, P. Krishnamurthy, B. Dolan-Gavitt, M. Shafique, K. Narasimhan, R. Karri, and O. Press, “Enigma: Interactive tools substantially assist lm agents in finding security vulnerabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2409.16165

  46. [48]

    Debuggers - CTF Handbook — ctf101.org the gnu debugger (gdb),

    O. L. . C. LLC, “Debuggers - CTF Handbook — ctf101.org the gnu debugger (gdb),” https://ctf101.org/reverse-engineering/what-is-gdb/, 2024

  47. [49]

    Training language model agents to find vulnerabilities with ctf-dojo,

    T. Y . Zhuo, D. Wang, H. Ding, V . Kumar, and Z. Wang, “Training language model agents to find vulnerabilities with ctf-dojo,” 2025. [Online]. Available: https://arxiv.org/abs/2508.18370

  48. [50]

    GitHub - amazon-science/Cyber-Zero: Cyber-Zero: Training Cybersecurity Agents Without Runtime — github.com,

    “GitHub - amazon-science/Cyber-Zero: Cyber-Zero: Training Cybersecurity Agents Without Runtime — github.com,” https://github.com/amazon-science/Cyber-Zero, 2025

  49. [51]

    How far have we gone in binary code understanding using large language models,

    X. Shang, S. Cheng, G. Chen, Y . Zhang, L. Hu, X. Yu, G. Li, W. Zhang, and N. Yu, “How far have we gone in binary code understanding using large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2404.09836

  50. [52]

    idev: Exploring and exploiting semantic deviations in arm instruction processing,

    S. Qin, C. Zhang, K. Chen, and Z. Li, “idev: Exploring and exploiting semantic deviations in arm instruction processing,” inProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2021, pp. 580–592

  51. [53]

    gem5: The gem5 simulator system — gem5.org,

    “gem5: The gem5 simulator system — gem5.org,” https://www.gem5.org/