pith. machine review for the scientific record. sign in

arxiv: 2604.08083 · v1 · submitted 2026-04-09 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 💻 cs.SE
keywords binary deobfuscationlarge language modelspseudocode recoveryreverse engineeringbenchmarkfine-tuningreasoning models
0
0 comments X

The pith

LLM deobfuscation of binaries succeeds through reasoning ability and task-specific fine-tuning rather than model scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BinDeObfBench, a benchmark that tests large language models on recovering pseudocode from binaries obfuscated at pre-compilation, compile-time, and post-compilation stages. It evaluates multiple models and finds that performance tracks more closely with reasoning strength and domain-specific training than with overall parameter count. Reasoning-oriented models remain stable even when obfuscation is extreme and transfer across different instruction set architectures and optimization levels. In-context examples help ordinary models more than reasoning ones, while supervised fine-tuning on the deobfuscation task consistently beats reliance on broad pre-training.

Core claim

The central claim is that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, that task-specific supervised fine-tuning consistently outperforms broad domain pre-training, and that reasoning models maintain robustness under severe obfuscation while generalizing across instruction set architectures and optimization levels. In-context learning benefits standard models but yields limited gains for reasoning models.

What carries the argument

BinDeObfBench, a benchmark spanning pre-compilation, compile-time, and post-compilation obfuscation transformations to measure pseudocode recovery quality from binaries.

If this is right

  • Reasoning models maintain high performance even under severe obfuscation.
  • Models generalize across different instruction set architectures and optimization levels.
  • Task-specific supervised fine-tuning produces better results than broad pre-training alone.
  • In-context learning improves standard models more than reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Security tool developers may achieve stronger results by fine-tuning smaller reasoning models on deobfuscation data instead of scaling general models.
  • The benchmark offers a reusable testbed for measuring future progress in LLM-assisted reverse engineering.
  • Extending the benchmark to real-world malware binaries could expose gaps not captured by synthetic transformations.

Load-bearing premise

The selected obfuscation transformations and automatic metrics for pseudocode quality sufficiently represent real-world reverse-engineering difficulty and success criteria.

What would settle it

A controlled test in which a much larger general-purpose model without task-specific fine-tuning achieves higher pseudocode similarity scores than fine-tuned reasoning models on the same obfuscated binaries would challenge the central finding.

Figures

Figures reproduced from arXiv: 2604.08083 by David Lo, Gangyang Li, Jieke Shi, Junqi Zhang, Li Hu, Shaoyin Cheng, Weiming Zhang, Xiuwei Shang, Zhou Yang.

Figure 1
Figure 1. Figure 1: Example of a bubble sort function in three representations: (top-left) source code, (bottom-left) pseudocode [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Workflow of BINDEOBFBENCH S3: Bridging the Knowledge Gap of LLMs in Obfuscated Binary Understanding. Existing studies [17, 55] have shown that, even with only prompt engineering, LLMs can outperform expert-designed small models on specific binary understanding tasks. Building on these findings, we adopt in-context learning to mitigate the knowledge gap induced by distribution shift in binary code deobfusca… view at source ↗
Figure 3
Figure 3. Figure 3: Grid search for optimal α. (1) Lexical Consistency. Since binary code deobfuscation aims to recover a readable representation that closely re￾flects the original program, lexical consistency provides a direct measure of how faithfully the recovered code matches the unobfuscated pseudocode. Specifically, it evaluates alignment at the textual level, including identifier naming, function signatures, and synta… view at source ↗
Figure 4
Figure 4. Figure 4: Performance under Different Obfuscation Levels. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance across Different Architectures and Optimizations. The first row represents the performance of [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance of LLMs and Non-LLM Deobfuscation Methods on Malware Dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Deobfuscation Performance with Respect to Pseudocode Length and Symbolic Information. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Deobfuscating binary code remains a fundamental challenge in reverse engineering, as obfuscation is widely used to hinder analysis and conceal program logic. Although large language models (LLMs) have shown promise in recovering semantics from obfuscated binaries, a systematic evaluation of their effectiveness is still lacking. In this work, we present BinDeObfBench, the first comprehensive benchmark for assessing LLM-based binary deobfuscation across diverse transformations spanning pre-compilation, compile-time, and post-compilation stages. Our evaluation shows that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, and that task-specific supervised fine-tuning consistently outperforms broad domain pre-training. Reasoning models can maintain robustness under severe obfuscation, generalize across different instruction set architectures (ISAs) and optimization levels. In-context learning benefits standard models but yields limited gains for reasoning models. Overall, our study highlights the importance of task-specific fine-tuning and reasoning-driven strategies, and positions BinDeObfBench as a basis for future work in binary deobfuscation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BinDeObfBench, the first comprehensive benchmark for evaluating LLMs on deobfuscating binary code to pseudocode across pre-compilation, compile-time, and post-compilation obfuscation transformations. It reports that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, that task-specific supervised fine-tuning (SFT) outperforms broad domain pre-training, and that reasoning models maintain robustness under severe obfuscation while generalizing across ISAs and optimization levels. In-context learning helps standard models but adds limited value for reasoning models.

Significance. If the results hold, the work is significant for establishing the first systematic benchmark in LLM-assisted binary deobfuscation, an area of practical importance in reverse engineering and security. The creation of BinDeObfBench itself provides a reusable resource for future studies. The empirical findings on reasoning vs. scale and the superiority of task-specific SFT offer actionable guidance for model selection and training strategies in this domain.

major comments (2)
  1. The evaluation relies on automatic metrics for pseudocode quality (e.g., similarity or semantic equivalence scores on BinDeObfBench) without any reported validation against human reverse-engineering judgments, such as accuracy in recovering control flow, identifying vulnerabilities, or time-to-understanding. This is load-bearing for the central claims about reasoning models, SFT advantages, and robustness, as superficial syntactic matches could inflate scores without reflecting practical utility.
  2. The generalization claims across ISAs and optimization levels (reported in the evaluation) require explicit details on data splits, potential leakage between training and test sets, and statistical tests for significance; without these, the reported robustness under severe obfuscation cannot be fully assessed for overfitting or selection bias.
minor comments (2)
  1. The abstract and introduction would benefit from a brief table summarizing the obfuscation transformations included in BinDeObfBench and the exact LLMs evaluated.
  2. Notation for the automatic metrics (e.g., how semantic equivalence is computed) should be defined more precisely in the methods section to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and will incorporate revisions to enhance the transparency and validation of our evaluation methodology.

read point-by-point responses
  1. Referee: The evaluation relies on automatic metrics for pseudocode quality (e.g., similarity or semantic equivalence scores on BinDeObfBench) without any reported validation against human reverse-engineering judgments, such as accuracy in recovering control flow, identifying vulnerabilities, or time-to-understanding. This is load-bearing for the central claims about reasoning models, SFT advantages, and robustness, as superficial syntactic matches could inflate scores without reflecting practical utility.

    Authors: We agree that direct validation against human reverse-engineering judgments would strengthen the practical relevance of our claims. Our benchmark relies on established automatic metrics for semantic equivalence and similarity, which are widely used in code generation and decompilation literature. However, we acknowledge the potential for these metrics to overstate utility if they do not align with human assessments of control flow recovery or vulnerability identification. In the revised manuscript, we will add a dedicated limitations subsection discussing this gap and include results from a small-scale human evaluation study (involving 3-5 expert reverse engineers) on a stratified subset of BinDeObfBench. We will report correlations between automatic scores and human ratings for key dimensions such as understandability and control-flow accuracy to better support the advantages of reasoning models and task-specific SFT. revision: yes

  2. Referee: The generalization claims across ISAs and optimization levels (reported in the evaluation) require explicit details on data splits, potential leakage between training and test sets, and statistical tests for significance; without these, the reported robustness under severe obfuscation cannot be fully assessed for overfitting or selection bias.

    Authors: We appreciate this request for greater methodological transparency. The current manuscript describes the benchmark construction and evaluation protocol at a high level, but we agree that explicit details on splits, leakage prevention, and significance testing are needed to fully substantiate the generalization and robustness claims. In the revised version, we will expand the 'Dataset Construction' and 'Experimental Setup' sections to detail: (1) the train/test split methodology ensuring no program or obfuscation configuration overlap (via unique source identifiers and hash-based deduplication); (2) verification steps confirming absence of leakage; and (3) statistical significance tests (e.g., paired Wilcoxon signed-rank tests with Bonferroni correction) applied to performance differences across ISAs and optimization levels. These additions will enable readers to rigorously evaluate potential overfitting or selection bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark study

full rationale

This is an empirical benchmark paper introducing BinDeObfBench and reporting LLM evaluation results across obfuscation transformations, ISAs, and optimization levels. No mathematical derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Central claims rest on experimental scores from the defined benchmark rather than any self-referential fitting, renaming, or self-citation load-bearing step. The study is self-contained against its own externally specified transformations and automatic metrics; no patterns from the enumerated circularity kinds are exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on standard machine-learning evaluation practices and the assumption that the constructed benchmark faithfully represents the deobfuscation problem.

invented entities (1)
  • BinDeObfBench no independent evidence
    purpose: Comprehensive benchmark dataset and evaluation suite for LLM-based binary-to-pseudocode deobfuscation
    Newly constructed for this study to enable systematic testing across obfuscation stages.

pith-pipeline@v0.9.0 · 5515 in / 1021 out tokens · 37498 ms · 2026-05-10T17:59:27.340060+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 21 canonical work pages · 8 internal anchors

  1. [1]

    The obfuscation executive

    Kelly Heffner and Christian Collberg. The obfuscation executive. InInternational Conference on Information Security, pages 428–440. Springer, 2004

  2. [2]

    Protecting software through obfuscation: Can it keep pace with progress in code analysis?Acm computing surveys (csur), 49(1):1–37, 2016

    Sebastian Schrittwieser, Stefan Katzenbeisser, Johannes Kinder, Georg Merzdovnik, and Edgar Weippl. Protecting software through obfuscation: Can it keep pace with progress in code analysis?Acm computing surveys (csur), 49(1):1–37, 2016

  3. [3]

    Chosen-instruction attack against commercial code virtualization obfuscators

    Shijia Li, Chunfu Jia, Pengda Qiu, Qiyuan Chen, Jiang Ming, and Debin Gao. Chosen-instruction attack against commercial code virtualization obfuscators. InIn Proceedings of the 29th Network and Distributed System Security Symposium, 2022

  4. [4]

    Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools

    Sangjun Ko, Jusop Choi, and Hyoungshick Kim. Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools. In2017 International conference on software security and assurance (ICSSA), pages 32–37. IEEE, 2017

  5. [5]

    Malware obfuscation techniques: A brief survey

    Ilsun You and Kangbin Yim. Malware obfuscation techniques: A brief survey. In2010 International conference on broadband, wireless computing, communication and applications, pages 297–300. IEEE, 2010

  6. [6]

    Binaryai: Binary software composition analysis via intelligent binary source code matching

    Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, and Yuqun Zhang. Binaryai: Binary software composition analysis via intelligent binary source code matching. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

  7. [7]

    Poster: E-graphs and equality saturation for term-rewriting in mba deobfuscation: An empirical study

    Seoksu Lee, Hyeongchang Jeon, and Eun-Sun Cho. Poster: E-graphs and equality saturation for term-rewriting in mba deobfuscation: An empirical study. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4985–4987, 2024

  8. [8]

    Simplifying mixed boolean-arithmetic obfuscation by program synthesis and term rewriting

    Jaehyung Lee and Woosuk Lee. Simplifying mixed boolean-arithmetic obfuscation by program synthesis and term rewriting. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2351–2365, 2023

  9. [9]

    A generic approach to automatic deobfuscation of executable code

    Babak Yadegari, Brian Johannesmeyer, Ben Whitely, and Saumya Debray. A generic approach to automatic deobfuscation of executable code. In2015 IEEE Symposium on Security and Privacy, pages 674–691. IEEE, 2015

  10. [10]

    Sok: Automatic deobfuscation of virtualization-protected applications

    Patrick Kochberger, Sebastian Schrittwieser, Stefan Schweighofer, Peter Kieseberg, and Edgar Weippl. Sok: Automatic deobfuscation of virtualization-protected applications. InProceedings of the 16th International Conference on Availability, Reliability and Security, pages 1–15, 2021

  11. [11]

    Control-flow deobfuscation using trace-informed compositional program synthesis.Proceedings of the ACM on Programming Languages, 8(OOPSLA2):2211–2241, 2024

    Benjamin Mariano, Ziteng Wang, Shankara Pailoor, Christian Collberg, and I¸ sil Dillig. Control-flow deobfuscation using trace-informed compositional program synthesis.Proceedings of the ACM on Programming Languages, 8(OOPSLA2):2211–2241, 2024

  12. [12]

    {MBA-Blast}: Unveiling and simplifying mixed {Boolean-Arithmetic} obfuscation

    Binbin Liu, Junfu Shen, Jiang Ming, Qilong Zheng, Jing Li, and Dongpeng Xu. {MBA-Blast}: Unveiling and simplifying mixed {Boolean-Arithmetic} obfuscation. In30th USENIX Security Symposium (USENIX Security 21), pages 1701–1718, 2021. 16 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation

  13. [13]

    Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering, 2024

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K Lahiri. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering, 2024

  14. [14]

    Automated program repair in the era of large pre-trained language models

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

  15. [15]

    Mingxuan Zhang, Bo Yuan, Hanzhe Li, and Kangming Xu. Llm-cloud complete: Leveraging cloud computing for efficient large language model-based code completion.Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023, 5(1):295–326, 2024

  16. [16]

    Degpt: Optimizing decompiler output with llm

    Peiwei Hu, Ruigang Liang, and Kai Chen. Degpt: Optimizing decompiler output with llm. InProceedings 2024 Network and Distributed System Security Symposium, volume 267622140, 2024

  17. [17]

    How far have we gone in binary code understanding using large language models

    Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, and Nenghai Yu. How far have we gone in binary code understanding using large language models. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 1–12. IEEE, 2024

  18. [18]

    Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025

    Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, and Nenghai Yu. Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025

  19. [19]

    Llm4decompile: Decom- piling binary code with large language models,

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. Llm4decompile: Decompiling binary code with large language models.arXiv preprint arXiv:2403.05286, 2024

  20. [20]

    Beyond classification: Inferring function names in stripped binaries via domain adapted llms

    Linxi Jiang, Xin Jin, and Zhiqiang Lin. Beyond classification: Inferring function names in stripped binaries via domain adapted llms. InProceedings of the 2025 on ACM SIGSAC Conference on Computer and Communications Security, 2025

  21. [21]

    Misum: Multi-modality heterogeneous code graph learning for multi-intent binary code summarization

    Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao. Misum: Multi-modality heterogeneous code graph learning for multi-intent binary code summarization. Proceedings of the ACM on Software Engineering, 2(FSE):1339–1362, 2025

  22. [22]

    Binary code summarization: Benchmarking chatgpt/gpt- 4 and other large language models.arXiv preprint arXiv:2312.09601, 2023

    Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. Binary code summarization: Benchmarking chatgpt/gpt- 4 and other large language models.arXiv preprint arXiv:2312.09601, 2023

  23. [23]

    Typeforge: Synthesizing and selecting best-fit composite data types for stripped binaries

    Yanzhong Wang, Ruigang Liang, Yilin Li, Peiwei Hu, Kai Chen, and Bolun Zhang. Typeforge: Synthesizing and selecting best-fit composite data types for stripped binaries. In2025 IEEE Symposium on Security and Privacy (SP), pages 1–18. IEEE, 2025

  24. [24]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  25. [25]

    Qwen3 technical report, 2025

    Qwen Team. Qwen3 technical report, 2025

  26. [26]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  27. [27]

    Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/blog/meta-llama-3-1/, 2025

    Meta AI. Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/blog/meta-llama-3-1/, 2025

  28. [28]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  29. [29]

    Hello gpt-4 turbo.https://openai.com/index/hello-gpt-4o/, 2024

    OpenAI. Hello gpt-4 turbo.https://openai.com/index/hello-gpt-4o/, 2024

  30. [30]

    OpenAI.https://openai.com/o1/, 2024

  31. [31]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  32. [32]

    Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025

    Guoqiang Chen, Huiqi Sun, Daguang Liu, Zhiqi Wang, Qiang Wang, Bin Yin, Lu Liu, and Lingyun Ying. Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025

  33. [33]

    Chatdeob: An effective deobfuscation method based on large language model

    Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. Chatdeob: An effective deobfuscation method based on large language model. InInternational Conference on Information Security Applications, pages 151–163. Springer, 2024

  34. [34]

    D810.https://github.com/joydo/d810

    joydo. D810.https://github.com/joydo/d810. 17 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation

  35. [35]

    Goomba.https://hex-rays.com/blog/deobfuscation-with-goomba

    Hex-Rays. Goomba.https://hex-rays.com/blog/deobfuscation-with-goomba

  36. [36]

    PhD thesis, Université Grenoble Alpes, 2019

    Matthieu Tofighi Shirazi.Analysis of obfuscation transformations on binary code. PhD thesis, Université Grenoble Alpes, 2019

  37. [37]

    Defeating opaque predicates statically through machine learning and binary analysis

    Ramtine Tofighi-Shirazi, Irina-Mariuca Asavoae, Philippe Elbaz-Vincent, and Thanh-Ha Le. Defeating opaque predicates statically through machine learning and binary analysis. InProceedings of the 3rd ACM Workshop on Software Protection, pages 3–14, 2019

  38. [38]

    X- mba: Towards heterogeneous mixed boolean-arithmetic deobfuscation

    Gengwang Li, Min Yu, Dongliang Fang, Gang Li, Xiang Meng, Jiangguo Jiang, and Weiqing Huang. X- mba: Towards heterogeneous mixed boolean-arithmetic deobfuscation. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM), pages 1082–1087. IEEE, 2024

  39. [39]

    Dose: Deobfuscation based on semantic equivalence

    Ramtine Tofighi-Shirazi, Maria Christofi, Philippe Elbaz-Vincent, and Thanh-Ha Le. Dose: Deobfuscation based on semantic equivalence. InProceedings of the 8th Software Security, Protection, and Reverse Engineering Workshop, pages 1–12, 2018

  40. [40]

    Search-based local black-box deobfuscation: understand, improve and mitigate

    Grégoire Menguy, Sébastien Bardin, Richard Bonichon, and Cauim de Souza Lima. Search-based local black-box deobfuscation: understand, improve and mitigate. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 2513–2525, 2021

  41. [41]

    Input-output example-guided data deobfuscation on binary.Security and Communication Networks, 2021(1):4646048, 2021

    Yujie Zhao, Zhanyong Tang, Guixin Ye, Xiaoqing Gong, and Dingyi Fang. Input-output example-guided data deobfuscation on binary.Security and Communication Networks, 2021(1):4646048, 2021

  42. [42]

    Qsynth-a program synthesis based approach for binary code deobfuscation

    Robin David, Luigi Coniglio, Mariano Ceccato, et al. Qsynth-a program synthesis based approach for binary code deobfuscation. InBAR 2020 Workshop, 2020

  43. [43]

    Syntia: Synthesizing the semantics of obfuscated code

    Tim Blazytko, Moritz Contag, Cornelius Aschermann, and Thorsten Holz. Syntia: Synthesizing the semantics of obfuscated code. In26th USENIX Security Symposium (USENIX Security 17), pages 643–659, 2017

  44. [44]

    Seead: A semantic-based approach for automatic binary code de-obfuscation

    Zhanyong Tang, Kaiyuan Kuang, Lei Wang, Chao Xue, Xiaoqing Gong, Xiaojiang Chen, Dingyi Fang, Jie Liu, and Zheng Wang. Seead: A semantic-based approach for automatic binary code de-obfuscation. In2017 IEEE Trustcom/BigDataSE/ICESS, pages 261–268. IEEE, 2017

  45. [45]

    Exploring the potential of llms for code deobfuscation

    David Beste, Grégoire Menguy, Hossein Hajipour, Mario Fritz, Antonio Emanuele Cinà, Sébastien Bardin, Thorsten Holz, Thorsten Eisenhofer, and Lea Schönherr. Exploring the potential of llms for code deobfuscation. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 267–286. Springer, 2025

  46. [46]

    Dobf: A deobfuscation pre-training objective for programming languages.Advances in Neural Information Processing Systems, 34:14967– 14979, 2021

    Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, and Guillaume Lample. Dobf: A deobfuscation pre-training objective for programming languages.Advances in Neural Information Processing Systems, 34:14967– 14979, 2021

  47. [47]

    Alfredo: Agentic llm-based framework for code deobfuscation

    Ching Yuhui Natalie, Sophie Tung Xuan Ying, and Siow Jing Kai. Alfredo: Agentic llm-based framework for code deobfuscation

  48. [48]

    Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation

    Seyedreza Mohseni, Seyedali Mohammadi, Deepa Tilwani, Yash Saxena, Gerald Ketu Ndawula, Sriram Vema, Edward Raff, and Manas Gaur. Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24893–24901, 2025

  49. [49]

    Deconstructing obfuscation: A four-dimensional framework for evaluating large language models assembly code deobfuscation capabilities.arXiv preprint arXiv:2505.19887, 2025

    Anton Tkachenko, Dmitrij Suskevic, and Benjamin Adolphi. Deconstructing obfuscation: A four-dimensional framework for evaluating large language models assembly code deobfuscation capabilities.arXiv preprint arXiv:2505.19887, 2025

  50. [50]

    Enabling obfuscation detection in binary software through explainable ai.IEEE Transactions on Emerging Topics in Computing, 2024

    Claudia Greco, Michele Ianni, Antonella Guzzo, and Giancarlo Fortino. Enabling obfuscation detection in binary software through explainable ai.IEEE Transactions on Emerging Topics in Computing, 2024

  51. [51]

    Debra: A real-world benchmark for evaluating deobfuscation methods

    Zheyun Feng and Dongpeng Xu. Debra: A real-world benchmark for evaluating deobfuscation methods. In Proceedings of the 2025 Workshop on Software Understanding and Reverse Engineering, pages 76–88, 2025

  52. [52]

    Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning

    B Sebastian, C Christian, and P Alexander. Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning. InProceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, pages 16–18, 2017

  53. [53]

    Mibench: A free, commercially representative embedded benchmark suite

    Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. Mibench: A free, commercially representative embedded benchmark suite. InProceedings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538), pages 3–14. IEEE, 2001

  54. [54]

    Spec cpu2006 benchmark descriptions.ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006

    John L Henning. Spec cpu2006 benchmark descriptions.ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006. 18 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation

  55. [55]

    An empirical study on the effectiveness of large language models for binary code understanding.arXiv preprint arXiv:2504.21803, 2025

    Xiuwei Shang, Zhenkan Fu, Shaoyin Cheng, Guoqiang Chen, Gangyang Li, Li Hu, Weiming Zhang, and Nenghai Yu. An empirical study on the effectiveness of large language models for binary code understanding.arXiv preprint arXiv:2504.21803, 2025

  56. [56]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  57. [57]

    arXiv preprint arXiv:2105.12655 (2021)

    Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks.arXiv preprint arXiv:2105.12655, 2021

  58. [58]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  59. [59]

    Evalplus leaderboard.https://evalplus.github.io/leaderboard.html, 2024

  60. [60]

    arXiv preprint arXiv:2102.04664 (2021) 16 A

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664, 2021

  61. [61]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.(2019).arXiv preprint arXiv:1909.09436, 1909

  62. [62]

    Ollvm.https://github.com/obfuscator-llvm/obfuscator

    obfuscator llvm. Ollvm.https://github.com/obfuscator-llvm/obfuscator

  63. [63]

    Hikari.https://github.com/HikariObfuscator/Hikari

    HikariObfuscator. Hikari.https://github.com/HikariObfuscator/Hikari

  64. [64]

    Tigress.https://tigress.wtf

    Arizona. Tigress.https://tigress.wtf

  65. [65]

    Alcatraz.https://github.com/weak1337/Alcatraz

    weak1337. Alcatraz.https://github.com/weak1337/Alcatraz

  66. [66]

    Loop: Logic-oriented opaque predicate detection in obfuscated binary code

    Jiang Ming, Dongpeng Xu, Li Wang, and Dinghao Wu. Loop: Logic-oriented opaque predicate detection in obfuscated binary code. InProceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 757–768, 2015

  67. [67]

    Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

    Joseph L Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

  68. [68]

    thezoo.https://github.com/ytisf/theZoo

    ytisf. thezoo.https://github.com/ytisf/theZoo

  69. [69]

    Malwaresourcecode.https://github.com/vxunderground/MalwareSourceCode

    vxunderground. Malwaresourcecode.https://github.com/vxunderground/MalwareSourceCode

  70. [70]

    A survey of large language models for code: Evolution, benchmarking, and future trends

    Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. A survey of large language models for code: Evolution, benchmarking, and future trends.arXiv preprint arXiv:2311.10372, 2023

  71. [71]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

  72. [72]

    OpenAI.https://platform.openai.com/docs/models/gpt-3.5-turbo, 2023

  73. [73]

    Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

  74. [74]

    When text embedding meets large language model: a comprehensive survey

    Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, and Richong Zhang. When text embedding meets large language model: a comprehensive survey.arXiv preprint arXiv:2412.09165, 2024

  75. [75]

    CoRR , volume =

    Ye Liu, Rui Meng, Shafiq Joty, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Codexembed: A generalist embedding model family for multiligual and multi-task code retrieval.arXiv preprint arXiv:2411.12644, 2024

  76. [76]

    Efficient code embeddings from code generation models.arXiv preprint arXiv:2508.21290, 2025

    Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, and Han Xiao. Efficient code embeddings from code generation models.arXiv preprint arXiv:2508.21290, 2025

  77. [77]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

  78. [78]

    Cambridge university press, 2020

    Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman.Mining of massive data sets. Cambridge university press, 2020

  79. [79]

    The dimensionality of program complexity

    John C Munson and Taghi M Khoshgoftaar. The dimensionality of program complexity. InProceedings of the 11th international conference on Software engineering, pages 245–253, 1989

  80. [80]

    Software complexity analysis using halstead metrics

    T Hariprasad, G Vidhyagaran, K Seenu, and Chandrasegar Thirumalai. Software complexity analysis using halstead metrics. In2017 international conference on trends in electronics and informatics (ICEI), pages 1109–1113. IEEE, 2017

Showing first 80 references.