arxiv: 2604.08083 · v1 · submitted 2026-04-09 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Can LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation

Li Hu , Xiuwei Shang , Jieke Shi , Shaoyin Cheng , Junqi Zhang , Gangyang Li , Zhou Yang , Weiming Zhang

show 1 more author

David Lo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 💻 cs.SE

keywords binary deobfuscationlarge language modelspseudocode recoveryreverse engineeringbenchmarkfine-tuningreasoning models

0 comments

The pith

LLM deobfuscation of binaries succeeds through reasoning ability and task-specific fine-tuning rather than model scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BinDeObfBench, a benchmark that tests large language models on recovering pseudocode from binaries obfuscated at pre-compilation, compile-time, and post-compilation stages. It evaluates multiple models and finds that performance tracks more closely with reasoning strength and domain-specific training than with overall parameter count. Reasoning-oriented models remain stable even when obfuscation is extreme and transfer across different instruction set architectures and optimization levels. In-context examples help ordinary models more than reasoning ones, while supervised fine-tuning on the deobfuscation task consistently beats reliance on broad pre-training.

Core claim

The central claim is that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, that task-specific supervised fine-tuning consistently outperforms broad domain pre-training, and that reasoning models maintain robustness under severe obfuscation while generalizing across instruction set architectures and optimization levels. In-context learning benefits standard models but yields limited gains for reasoning models.

What carries the argument

BinDeObfBench, a benchmark spanning pre-compilation, compile-time, and post-compilation obfuscation transformations to measure pseudocode recovery quality from binaries.

If this is right

Reasoning models maintain high performance even under severe obfuscation.
Models generalize across different instruction set architectures and optimization levels.
Task-specific supervised fine-tuning produces better results than broad pre-training alone.
In-context learning improves standard models more than reasoning models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Security tool developers may achieve stronger results by fine-tuning smaller reasoning models on deobfuscation data instead of scaling general models.
The benchmark offers a reusable testbed for measuring future progress in LLM-assisted reverse engineering.
Extending the benchmark to real-world malware binaries could expose gaps not captured by synthetic transformations.

Load-bearing premise

The selected obfuscation transformations and automatic metrics for pseudocode quality sufficiently represent real-world reverse-engineering difficulty and success criteria.

What would settle it

A controlled test in which a much larger general-purpose model without task-specific fine-tuning achieves higher pseudocode similarity scores than fine-tuned reasoning models on the same obfuscated binaries would challenge the central finding.

Figures

Figures reproduced from arXiv: 2604.08083 by David Lo, Gangyang Li, Jieke Shi, Junqi Zhang, Li Hu, Shaoyin Cheng, Weiming Zhang, Xiuwei Shang, Zhou Yang.

**Figure 1.** Figure 1: Example of a bubble sort function in three representations: (top-left) source code, (bottom-left) pseudocode [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Workflow of BINDEOBFBENCH S3: Bridging the Knowledge Gap of LLMs in Obfuscated Binary Understanding. Existing studies [17, 55] have shown that, even with only prompt engineering, LLMs can outperform expert-designed small models on specific binary understanding tasks. Building on these findings, we adopt in-context learning to mitigate the knowledge gap induced by distribution shift in binary code deobfusca… view at source ↗

**Figure 3.** Figure 3: Grid search for optimal α. (1) Lexical Consistency. Since binary code deobfuscation aims to recover a readable representation that closely reflects the original program, lexical consistency provides a direct measure of how faithfully the recovered code matches the unobfuscated pseudocode. Specifically, it evaluates alignment at the textual level, including identifier naming, function signatures, and synta… view at source ↗

**Figure 4.** Figure 4: Performance under Different Obfuscation Levels. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Performance across Different Architectures and Optimizations. The first row represents the performance of [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of LLMs and Non-LLM Deobfuscation Methods on Malware Dataset. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Deobfuscation Performance with Respect to Pseudocode Length and Symbolic Information. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

Deobfuscating binary code remains a fundamental challenge in reverse engineering, as obfuscation is widely used to hinder analysis and conceal program logic. Although large language models (LLMs) have shown promise in recovering semantics from obfuscated binaries, a systematic evaluation of their effectiveness is still lacking. In this work, we present BinDeObfBench, the first comprehensive benchmark for assessing LLM-based binary deobfuscation across diverse transformations spanning pre-compilation, compile-time, and post-compilation stages. Our evaluation shows that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, and that task-specific supervised fine-tuning consistently outperforms broad domain pre-training. Reasoning models can maintain robustness under severe obfuscation, generalize across different instruction set architectures (ISAs) and optimization levels. In-context learning benefits standard models but yields limited gains for reasoning models. Overall, our study highlights the importance of task-specific fine-tuning and reasoning-driven strategies, and positions BinDeObfBench as a basis for future work in binary deobfuscation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New benchmark BinDeObfBench compares LLMs on binary deobfuscation and finds reasoning plus task-specific fine-tuning outperform scale, but the automatic pseudocode metrics are unvalidated against human analyst judgments.

read the letter

The main takeaway is that this paper introduces BinDeObfBench as the first broad test set for turning obfuscated binaries into pseudocode and reports that reasoning models plus supervised fine-tuning deliver better results than bigger models or general pre-training, with some robustness under heavy obfuscation and across ISAs and optimization levels. In-context learning helps ordinary models more than reasoning ones. That gives a concrete set of comparative numbers where none existed before for this exact task mix of pre-compile, compile-time, and post-compile transformations. The systematic coverage and the reported trends on what actually drives performance are the parts that hold up. The work ships a new resource and documents patterns that others can test or extend. The soft spot sits in the evaluation itself. The central claims rest on automatic metrics for pseudocode quality, yet nothing in the paper shows these scores line up with what a human reverse engineer would count as useful output, such as faster control-flow understanding or accurate vulnerability spotting. If the metrics reward surface matches while missing semantic gaps that matter in practice, the advantages claimed for reasoning models and fine-tuning could shrink or disappear under real conditions. The abstract and setup do not include any human correlation study or alternative success criteria, so that gap is load-bearing for the interpretation. This paper is for researchers building or evaluating tools that apply LLMs to security analysis and reverse engineering. Anyone who needs a starting benchmark or wants to see head-to-head numbers on fine-tuning versus scale for code tasks will get direct value. It deserves a serious referee because the benchmark is new and the empirical questions are well-posed, even with the metric limitation. I would send it to review and ask the authors to add human validation or at least discuss why the chosen automatic measures are sufficient.

Referee Report

2 major / 2 minor

Summary. The paper introduces BinDeObfBench, the first comprehensive benchmark for evaluating LLMs on deobfuscating binary code to pseudocode across pre-compilation, compile-time, and post-compilation obfuscation transformations. It reports that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, that task-specific supervised fine-tuning (SFT) outperforms broad domain pre-training, and that reasoning models maintain robustness under severe obfuscation while generalizing across ISAs and optimization levels. In-context learning helps standard models but adds limited value for reasoning models.

Significance. If the results hold, the work is significant for establishing the first systematic benchmark in LLM-assisted binary deobfuscation, an area of practical importance in reverse engineering and security. The creation of BinDeObfBench itself provides a reusable resource for future studies. The empirical findings on reasoning vs. scale and the superiority of task-specific SFT offer actionable guidance for model selection and training strategies in this domain.

major comments (2)

The evaluation relies on automatic metrics for pseudocode quality (e.g., similarity or semantic equivalence scores on BinDeObfBench) without any reported validation against human reverse-engineering judgments, such as accuracy in recovering control flow, identifying vulnerabilities, or time-to-understanding. This is load-bearing for the central claims about reasoning models, SFT advantages, and robustness, as superficial syntactic matches could inflate scores without reflecting practical utility.
The generalization claims across ISAs and optimization levels (reported in the evaluation) require explicit details on data splits, potential leakage between training and test sets, and statistical tests for significance; without these, the reported robustness under severe obfuscation cannot be fully assessed for overfitting or selection bias.

minor comments (2)

The abstract and introduction would benefit from a brief table summarizing the obfuscation transformations included in BinDeObfBench and the exact LLMs evaluated.
Notation for the automatic metrics (e.g., how semantic equivalence is computed) should be defined more precisely in the methods section to allow replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and will incorporate revisions to enhance the transparency and validation of our evaluation methodology.

read point-by-point responses

Referee: The evaluation relies on automatic metrics for pseudocode quality (e.g., similarity or semantic equivalence scores on BinDeObfBench) without any reported validation against human reverse-engineering judgments, such as accuracy in recovering control flow, identifying vulnerabilities, or time-to-understanding. This is load-bearing for the central claims about reasoning models, SFT advantages, and robustness, as superficial syntactic matches could inflate scores without reflecting practical utility.

Authors: We agree that direct validation against human reverse-engineering judgments would strengthen the practical relevance of our claims. Our benchmark relies on established automatic metrics for semantic equivalence and similarity, which are widely used in code generation and decompilation literature. However, we acknowledge the potential for these metrics to overstate utility if they do not align with human assessments of control flow recovery or vulnerability identification. In the revised manuscript, we will add a dedicated limitations subsection discussing this gap and include results from a small-scale human evaluation study (involving 3-5 expert reverse engineers) on a stratified subset of BinDeObfBench. We will report correlations between automatic scores and human ratings for key dimensions such as understandability and control-flow accuracy to better support the advantages of reasoning models and task-specific SFT. revision: yes
Referee: The generalization claims across ISAs and optimization levels (reported in the evaluation) require explicit details on data splits, potential leakage between training and test sets, and statistical tests for significance; without these, the reported robustness under severe obfuscation cannot be fully assessed for overfitting or selection bias.

Authors: We appreciate this request for greater methodological transparency. The current manuscript describes the benchmark construction and evaluation protocol at a high level, but we agree that explicit details on splits, leakage prevention, and significance testing are needed to fully substantiate the generalization and robustness claims. In the revised version, we will expand the 'Dataset Construction' and 'Experimental Setup' sections to detail: (1) the train/test split methodology ensuring no program or obfuscation configuration overlap (via unique source identifiers and hash-based deduplication); (2) verification steps confirming absence of leakage; and (3) statistical significance tests (e.g., paired Wilcoxon signed-rank tests with Bonferroni correction) applied to performance differences across ISAs and optimization levels. These additions will enable readers to rigorously evaluate potential overfitting or selection bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark study

full rationale

This is an empirical benchmark paper introducing BinDeObfBench and reporting LLM evaluation results across obfuscation transformations, ISAs, and optimization levels. No mathematical derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Central claims rest on experimental scores from the defined benchmark rather than any self-referential fitting, renaming, or self-citation load-bearing step. The study is self-contained against its own externally specified transformations and automatic metrics; no patterns from the enumerated circularity kinds are exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claims rest on standard machine-learning evaluation practices and the assumption that the constructed benchmark faithfully represents the deobfuscation problem.

invented entities (1)

BinDeObfBench no independent evidence
purpose: Comprehensive benchmark dataset and evaluation suite for LLM-based binary-to-pseudocode deobfuscation
Newly constructed for this study to enable systematic testing across obfuscation stages.

pith-pipeline@v0.9.0 · 5515 in / 1021 out tokens · 37498 ms · 2026-05-10T17:59:27.340060+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present BINDEOBFBENCH, the first comprehensive benchmark for assessing LLM-based binary deobfuscation across diverse transformations... deobfuscation performance depends more on reasoning capability and domain expertise than on model scale
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design four evaluation metrics... lexical consistency (BLEU), semantic preservation (Dual-Perspective Semantic Fusion), code simplicity (token-wise delta entropy), code readability (Halstead)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 21 canonical work pages · 8 internal anchors

[1]

The obfuscation executive

Kelly Heffner and Christian Collberg. The obfuscation executive. InInternational Conference on Information Security, pages 428–440. Springer, 2004

2004
[2]

Protecting software through obfuscation: Can it keep pace with progress in code analysis?Acm computing surveys (csur), 49(1):1–37, 2016

Sebastian Schrittwieser, Stefan Katzenbeisser, Johannes Kinder, Georg Merzdovnik, and Edgar Weippl. Protecting software through obfuscation: Can it keep pace with progress in code analysis?Acm computing surveys (csur), 49(1):1–37, 2016

2016
[3]

Chosen-instruction attack against commercial code virtualization obfuscators

Shijia Li, Chunfu Jia, Pengda Qiu, Qiyuan Chen, Jiang Ming, and Debin Gao. Chosen-instruction attack against commercial code virtualization obfuscators. InIn Proceedings of the 29th Network and Distributed System Security Symposium, 2022

2022
[4]

Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools

Sangjun Ko, Jusop Choi, and Hyoungshick Kim. Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools. In2017 International conference on software security and assurance (ICSSA), pages 32–37. IEEE, 2017

2017
[5]

Malware obfuscation techniques: A brief survey

Ilsun You and Kangbin Yim. Malware obfuscation techniques: A brief survey. In2010 International conference on broadband, wireless computing, communication and applications, pages 297–300. IEEE, 2010

2010
[6]

Binaryai: Binary software composition analysis via intelligent binary source code matching

Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, and Yuqun Zhang. Binaryai: Binary software composition analysis via intelligent binary source code matching. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

2024
[7]

Poster: E-graphs and equality saturation for term-rewriting in mba deobfuscation: An empirical study

Seoksu Lee, Hyeongchang Jeon, and Eun-Sun Cho. Poster: E-graphs and equality saturation for term-rewriting in mba deobfuscation: An empirical study. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4985–4987, 2024

2024
[8]

Simplifying mixed boolean-arithmetic obfuscation by program synthesis and term rewriting

Jaehyung Lee and Woosuk Lee. Simplifying mixed boolean-arithmetic obfuscation by program synthesis and term rewriting. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2351–2365, 2023

2023
[9]

A generic approach to automatic deobfuscation of executable code

Babak Yadegari, Brian Johannesmeyer, Ben Whitely, and Saumya Debray. A generic approach to automatic deobfuscation of executable code. In2015 IEEE Symposium on Security and Privacy, pages 674–691. IEEE, 2015

2015
[10]

Sok: Automatic deobfuscation of virtualization-protected applications

Patrick Kochberger, Sebastian Schrittwieser, Stefan Schweighofer, Peter Kieseberg, and Edgar Weippl. Sok: Automatic deobfuscation of virtualization-protected applications. InProceedings of the 16th International Conference on Availability, Reliability and Security, pages 1–15, 2021

2021
[11]

Control-flow deobfuscation using trace-informed compositional program synthesis.Proceedings of the ACM on Programming Languages, 8(OOPSLA2):2211–2241, 2024

Benjamin Mariano, Ziteng Wang, Shankara Pailoor, Christian Collberg, and I¸ sil Dillig. Control-flow deobfuscation using trace-informed compositional program synthesis.Proceedings of the ACM on Programming Languages, 8(OOPSLA2):2211–2241, 2024

2024
[12]

{MBA-Blast}: Unveiling and simplifying mixed {Boolean-Arithmetic} obfuscation

Binbin Liu, Junfu Shen, Jiang Ming, Qilong Zheng, Jing Li, and Dongpeng Xu. {MBA-Blast}: Unveiling and simplifying mixed {Boolean-Arithmetic} obfuscation. In30th USENIX Security Symposium (USENIX Security 21), pages 1701–1718, 2021. 16 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation

2021
[13]

Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering, 2024

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K Lahiri. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering, 2024

2024
[14]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023

2023
[15]

Mingxuan Zhang, Bo Yuan, Hanzhe Li, and Kangming Xu. Llm-cloud complete: Leveraging cloud computing for efficient large language model-based code completion.Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023, 5(1):295–326, 2024

2024
[16]

Degpt: Optimizing decompiler output with llm

Peiwei Hu, Ruigang Liang, and Kai Chen. Degpt: Optimizing decompiler output with llm. InProceedings 2024 Network and Distributed System Security Symposium, volume 267622140, 2024

2024
[17]

How far have we gone in binary code understanding using large language models

Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, and Nenghai Yu. How far have we gone in binary code understanding using large language models. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 1–12. IEEE, 2024

2024
[18]

Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025

Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, and Nenghai Yu. Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025

work page arXiv 2025
[19]

Llm4decompile: Decom- piling binary code with large language models,

Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. Llm4decompile: Decompiling binary code with large language models.arXiv preprint arXiv:2403.05286, 2024

work page arXiv 2024
[20]

Beyond classification: Inferring function names in stripped binaries via domain adapted llms

Linxi Jiang, Xin Jin, and Zhiqiang Lin. Beyond classification: Inferring function names in stripped binaries via domain adapted llms. InProceedings of the 2025 on ACM SIGSAC Conference on Computer and Communications Security, 2025

2025
[21]

Misum: Multi-modality heterogeneous code graph learning for multi-intent binary code summarization

Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao. Misum: Multi-modality heterogeneous code graph learning for multi-intent binary code summarization. Proceedings of the ACM on Software Engineering, 2(FSE):1339–1362, 2025

2025
[22]

Binary code summarization: Benchmarking chatgpt/gpt- 4 and other large language models.arXiv preprint arXiv:2312.09601, 2023

Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. Binary code summarization: Benchmarking chatgpt/gpt- 4 and other large language models.arXiv preprint arXiv:2312.09601, 2023

work page arXiv 2023
[23]

Typeforge: Synthesizing and selecting best-fit composite data types for stripped binaries

Yanzhong Wang, Ruigang Liang, Yilin Li, Peiwei Hu, Kai Chen, and Bolun Zhang. Typeforge: Synthesizing and selecting best-fit composite data types for stripped binaries. In2025 IEEE Symposium on Security and Privacy (SP), pages 1–18. IEEE, 2025

2025
[24]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Qwen3 technical report, 2025

Qwen Team. Qwen3 technical report, 2025

2025
[26]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/blog/meta-llama-3-1/, 2025

Meta AI. Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/blog/meta-llama-3-1/, 2025

2025
[28]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Hello gpt-4 turbo.https://openai.com/index/hello-gpt-4o/, 2024

OpenAI. Hello gpt-4 turbo.https://openai.com/index/hello-gpt-4o/, 2024

2024
[30]

OpenAI.https://openai.com/o1/, 2024

2024
[31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025

Guoqiang Chen, Huiqi Sun, Daguang Liu, Zhiqi Wang, Qiang Wang, Bin Yin, Lu Liu, and Lingyun Ying. Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025

work page arXiv 2025
[33]

Chatdeob: An effective deobfuscation method based on large language model

Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. Chatdeob: An effective deobfuscation method based on large language model. InInternational Conference on Information Security Applications, pages 151–163. Springer, 2024

2024
[34]

D810.https://github.com/joydo/d810

joydo. D810.https://github.com/joydo/d810. 17 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation
[35]

Goomba.https://hex-rays.com/blog/deobfuscation-with-goomba

Hex-Rays. Goomba.https://hex-rays.com/blog/deobfuscation-with-goomba
[36]

PhD thesis, Université Grenoble Alpes, 2019

Matthieu Tofighi Shirazi.Analysis of obfuscation transformations on binary code. PhD thesis, Université Grenoble Alpes, 2019

2019
[37]

Defeating opaque predicates statically through machine learning and binary analysis

Ramtine Tofighi-Shirazi, Irina-Mariuca Asavoae, Philippe Elbaz-Vincent, and Thanh-Ha Le. Defeating opaque predicates statically through machine learning and binary analysis. InProceedings of the 3rd ACM Workshop on Software Protection, pages 3–14, 2019

2019
[38]

X- mba: Towards heterogeneous mixed boolean-arithmetic deobfuscation

Gengwang Li, Min Yu, Dongliang Fang, Gang Li, Xiang Meng, Jiangguo Jiang, and Weiqing Huang. X- mba: Towards heterogeneous mixed boolean-arithmetic deobfuscation. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM), pages 1082–1087. IEEE, 2024

2024
[39]

Dose: Deobfuscation based on semantic equivalence

Ramtine Tofighi-Shirazi, Maria Christofi, Philippe Elbaz-Vincent, and Thanh-Ha Le. Dose: Deobfuscation based on semantic equivalence. InProceedings of the 8th Software Security, Protection, and Reverse Engineering Workshop, pages 1–12, 2018

2018
[40]

Search-based local black-box deobfuscation: understand, improve and mitigate

Grégoire Menguy, Sébastien Bardin, Richard Bonichon, and Cauim de Souza Lima. Search-based local black-box deobfuscation: understand, improve and mitigate. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 2513–2525, 2021

2021
[41]

Input-output example-guided data deobfuscation on binary.Security and Communication Networks, 2021(1):4646048, 2021

Yujie Zhao, Zhanyong Tang, Guixin Ye, Xiaoqing Gong, and Dingyi Fang. Input-output example-guided data deobfuscation on binary.Security and Communication Networks, 2021(1):4646048, 2021

2021
[42]

Qsynth-a program synthesis based approach for binary code deobfuscation

Robin David, Luigi Coniglio, Mariano Ceccato, et al. Qsynth-a program synthesis based approach for binary code deobfuscation. InBAR 2020 Workshop, 2020

2020
[43]

Syntia: Synthesizing the semantics of obfuscated code

Tim Blazytko, Moritz Contag, Cornelius Aschermann, and Thorsten Holz. Syntia: Synthesizing the semantics of obfuscated code. In26th USENIX Security Symposium (USENIX Security 17), pages 643–659, 2017

2017
[44]

Seead: A semantic-based approach for automatic binary code de-obfuscation

Zhanyong Tang, Kaiyuan Kuang, Lei Wang, Chao Xue, Xiaoqing Gong, Xiaojiang Chen, Dingyi Fang, Jie Liu, and Zheng Wang. Seead: A semantic-based approach for automatic binary code de-obfuscation. In2017 IEEE Trustcom/BigDataSE/ICESS, pages 261–268. IEEE, 2017

2017
[45]

Exploring the potential of llms for code deobfuscation

David Beste, Grégoire Menguy, Hossein Hajipour, Mario Fritz, Antonio Emanuele Cinà, Sébastien Bardin, Thorsten Holz, Thorsten Eisenhofer, and Lea Schönherr. Exploring the potential of llms for code deobfuscation. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 267–286. Springer, 2025

2025
[46]

Dobf: A deobfuscation pre-training objective for programming languages.Advances in Neural Information Processing Systems, 34:14967– 14979, 2021

Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, and Guillaume Lample. Dobf: A deobfuscation pre-training objective for programming languages.Advances in Neural Information Processing Systems, 34:14967– 14979, 2021

2021
[47]

Alfredo: Agentic llm-based framework for code deobfuscation

Ching Yuhui Natalie, Sophie Tung Xuan Ying, and Siow Jing Kai. Alfredo: Agentic llm-based framework for code deobfuscation
[48]

Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation

Seyedreza Mohseni, Seyedali Mohammadi, Deepa Tilwani, Yash Saxena, Gerald Ketu Ndawula, Sriram Vema, Edward Raff, and Manas Gaur. Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24893–24901, 2025

2025
[49]

Deconstructing obfuscation: A four-dimensional framework for evaluating large language models assembly code deobfuscation capabilities.arXiv preprint arXiv:2505.19887, 2025

Anton Tkachenko, Dmitrij Suskevic, and Benjamin Adolphi. Deconstructing obfuscation: A four-dimensional framework for evaluating large language models assembly code deobfuscation capabilities.arXiv preprint arXiv:2505.19887, 2025

work page arXiv 2025
[50]

Enabling obfuscation detection in binary software through explainable ai.IEEE Transactions on Emerging Topics in Computing, 2024

Claudia Greco, Michele Ianni, Antonella Guzzo, and Giancarlo Fortino. Enabling obfuscation detection in binary software through explainable ai.IEEE Transactions on Emerging Topics in Computing, 2024

2024
[51]

Debra: A real-world benchmark for evaluating deobfuscation methods

Zheyun Feng and Dongpeng Xu. Debra: A real-world benchmark for evaluating deobfuscation methods. In Proceedings of the 2025 Workshop on Software Understanding and Reverse Engineering, pages 76–88, 2025

2025
[52]

Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning

B Sebastian, C Christian, and P Alexander. Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning. InProceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, pages 16–18, 2017

2017
[53]

Mibench: A free, commercially representative embedded benchmark suite

Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. Mibench: A free, commercially representative embedded benchmark suite. InProceedings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538), pages 3–14. IEEE, 2001

2001
[54]

Spec cpu2006 benchmark descriptions.ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006

John L Henning. Spec cpu2006 benchmark descriptions.ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006. 18 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation

2006
[55]

An empirical study on the effectiveness of large language models for binary code understanding.arXiv preprint arXiv:2504.21803, 2025

Xiuwei Shang, Zhenkan Fu, Shaoyin Cheng, Guoqiang Chen, Gangyang Li, Li Hu, Weiming Zhang, and Nenghai Yu. An empirical study on the effectiveness of large language models for binary code understanding.arXiv preprint arXiv:2504.21803, 2025

work page arXiv 2025
[56]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[57]

arXiv preprint arXiv:2105.12655 (2021)

Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks.arXiv preprint arXiv:2105.12655, 2021

work page arXiv 2021
[58]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

1901
[59]

Evalplus leaderboard.https://evalplus.github.io/leaderboard.html, 2024

2024
[60]

arXiv preprint arXiv:2102.04664 (2021) 16 A

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664, 2021

work page arXiv 2021
[61]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.(2019).arXiv preprint arXiv:1909.09436, 1909

work page internal anchor Pith review arXiv 2019
[62]

Ollvm.https://github.com/obfuscator-llvm/obfuscator

obfuscator llvm. Ollvm.https://github.com/obfuscator-llvm/obfuscator
[63]

Hikari.https://github.com/HikariObfuscator/Hikari

HikariObfuscator. Hikari.https://github.com/HikariObfuscator/Hikari
[64]

Tigress.https://tigress.wtf

Arizona. Tigress.https://tigress.wtf
[65]

Alcatraz.https://github.com/weak1337/Alcatraz

weak1337. Alcatraz.https://github.com/weak1337/Alcatraz
[66]

Loop: Logic-oriented opaque predicate detection in obfuscated binary code

Jiang Ming, Dongpeng Xu, Li Wang, and Dinghao Wu. Loop: Logic-oriented opaque predicate detection in obfuscated binary code. InProceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 757–768, 2015

2015
[67]

Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

Joseph L Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971

1971
[68]

thezoo.https://github.com/ytisf/theZoo

ytisf. thezoo.https://github.com/ytisf/theZoo
[69]

Malwaresourcecode.https://github.com/vxunderground/MalwareSourceCode

vxunderground. Malwaresourcecode.https://github.com/vxunderground/MalwareSourceCode
[70]

A survey of large language models for code: Evolution, benchmarking, and future trends

Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. A survey of large language models for code: Evolution, benchmarking, and future trends.arXiv preprint arXiv:2311.10372, 2023

work page arXiv 2023
[71]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[72]

OpenAI.https://platform.openai.com/docs/models/gpt-3.5-turbo, 2023

2023
[73]

Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

work page arXiv 2024
[74]

When text embedding meets large language model: a comprehensive survey

Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, and Richong Zhang. When text embedding meets large language model: a comprehensive survey.arXiv preprint arXiv:2412.09165, 2024

work page arXiv 2024
[75]

CoRR , volume =

Ye Liu, Rui Meng, Shafiq Joty, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Codexembed: A generalist embedding model family for multiligual and multi-task code retrieval.arXiv preprint arXiv:2411.12644, 2024

work page arXiv 2024
[76]

Efficient code embeddings from code generation models.arXiv preprint arXiv:2508.21290, 2025

Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, and Han Xiao. Efficient code embeddings from code generation models.arXiv preprint arXiv:2508.21290, 2025

work page arXiv 2025
[77]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[78]

Cambridge university press, 2020

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman.Mining of massive data sets. Cambridge university press, 2020

2020
[79]

The dimensionality of program complexity

John C Munson and Taghi M Khoshgoftaar. The dimensionality of program complexity. InProceedings of the 11th international conference on Software engineering, pages 245–253, 1989

1989
[80]

Software complexity analysis using halstead metrics

T Hariprasad, G Vidhyagaran, K Seenu, and Chandrasegar Thirumalai. Software complexity analysis using halstead metrics. In2017 international conference on trends in electronics and informatics (ICEI), pages 1109–1113. IEEE, 2017

2017

Showing first 80 references.