Recognition: 2 theorem links
· Lean TheoremCan LLMs Deobfuscate Binary Code? A Systematic Analysis of Large Language Models into Pseudocode Deobfuscation
Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3
The pith
LLM deobfuscation of binaries succeeds through reasoning ability and task-specific fine-tuning rather than model scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, that task-specific supervised fine-tuning consistently outperforms broad domain pre-training, and that reasoning models maintain robustness under severe obfuscation while generalizing across instruction set architectures and optimization levels. In-context learning benefits standard models but yields limited gains for reasoning models.
What carries the argument
BinDeObfBench, a benchmark spanning pre-compilation, compile-time, and post-compilation obfuscation transformations to measure pseudocode recovery quality from binaries.
If this is right
- Reasoning models maintain high performance even under severe obfuscation.
- Models generalize across different instruction set architectures and optimization levels.
- Task-specific supervised fine-tuning produces better results than broad pre-training alone.
- In-context learning improves standard models more than reasoning models.
Where Pith is reading between the lines
- Security tool developers may achieve stronger results by fine-tuning smaller reasoning models on deobfuscation data instead of scaling general models.
- The benchmark offers a reusable testbed for measuring future progress in LLM-assisted reverse engineering.
- Extending the benchmark to real-world malware binaries could expose gaps not captured by synthetic transformations.
Load-bearing premise
The selected obfuscation transformations and automatic metrics for pseudocode quality sufficiently represent real-world reverse-engineering difficulty and success criteria.
What would settle it
A controlled test in which a much larger general-purpose model without task-specific fine-tuning achieves higher pseudocode similarity scores than fine-tuned reasoning models on the same obfuscated binaries would challenge the central finding.
Figures
read the original abstract
Deobfuscating binary code remains a fundamental challenge in reverse engineering, as obfuscation is widely used to hinder analysis and conceal program logic. Although large language models (LLMs) have shown promise in recovering semantics from obfuscated binaries, a systematic evaluation of their effectiveness is still lacking. In this work, we present BinDeObfBench, the first comprehensive benchmark for assessing LLM-based binary deobfuscation across diverse transformations spanning pre-compilation, compile-time, and post-compilation stages. Our evaluation shows that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, and that task-specific supervised fine-tuning consistently outperforms broad domain pre-training. Reasoning models can maintain robustness under severe obfuscation, generalize across different instruction set architectures (ISAs) and optimization levels. In-context learning benefits standard models but yields limited gains for reasoning models. Overall, our study highlights the importance of task-specific fine-tuning and reasoning-driven strategies, and positions BinDeObfBench as a basis for future work in binary deobfuscation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BinDeObfBench, the first comprehensive benchmark for evaluating LLMs on deobfuscating binary code to pseudocode across pre-compilation, compile-time, and post-compilation obfuscation transformations. It reports that deobfuscation performance depends more on reasoning capability and domain expertise than on model scale, that task-specific supervised fine-tuning (SFT) outperforms broad domain pre-training, and that reasoning models maintain robustness under severe obfuscation while generalizing across ISAs and optimization levels. In-context learning helps standard models but adds limited value for reasoning models.
Significance. If the results hold, the work is significant for establishing the first systematic benchmark in LLM-assisted binary deobfuscation, an area of practical importance in reverse engineering and security. The creation of BinDeObfBench itself provides a reusable resource for future studies. The empirical findings on reasoning vs. scale and the superiority of task-specific SFT offer actionable guidance for model selection and training strategies in this domain.
major comments (2)
- The evaluation relies on automatic metrics for pseudocode quality (e.g., similarity or semantic equivalence scores on BinDeObfBench) without any reported validation against human reverse-engineering judgments, such as accuracy in recovering control flow, identifying vulnerabilities, or time-to-understanding. This is load-bearing for the central claims about reasoning models, SFT advantages, and robustness, as superficial syntactic matches could inflate scores without reflecting practical utility.
- The generalization claims across ISAs and optimization levels (reported in the evaluation) require explicit details on data splits, potential leakage between training and test sets, and statistical tests for significance; without these, the reported robustness under severe obfuscation cannot be fully assessed for overfitting or selection bias.
minor comments (2)
- The abstract and introduction would benefit from a brief table summarizing the obfuscation transformations included in BinDeObfBench and the exact LLMs evaluated.
- Notation for the automatic metrics (e.g., how semantic equivalence is computed) should be defined more precisely in the methods section to allow replication.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major point below and will incorporate revisions to enhance the transparency and validation of our evaluation methodology.
read point-by-point responses
-
Referee: The evaluation relies on automatic metrics for pseudocode quality (e.g., similarity or semantic equivalence scores on BinDeObfBench) without any reported validation against human reverse-engineering judgments, such as accuracy in recovering control flow, identifying vulnerabilities, or time-to-understanding. This is load-bearing for the central claims about reasoning models, SFT advantages, and robustness, as superficial syntactic matches could inflate scores without reflecting practical utility.
Authors: We agree that direct validation against human reverse-engineering judgments would strengthen the practical relevance of our claims. Our benchmark relies on established automatic metrics for semantic equivalence and similarity, which are widely used in code generation and decompilation literature. However, we acknowledge the potential for these metrics to overstate utility if they do not align with human assessments of control flow recovery or vulnerability identification. In the revised manuscript, we will add a dedicated limitations subsection discussing this gap and include results from a small-scale human evaluation study (involving 3-5 expert reverse engineers) on a stratified subset of BinDeObfBench. We will report correlations between automatic scores and human ratings for key dimensions such as understandability and control-flow accuracy to better support the advantages of reasoning models and task-specific SFT. revision: yes
-
Referee: The generalization claims across ISAs and optimization levels (reported in the evaluation) require explicit details on data splits, potential leakage between training and test sets, and statistical tests for significance; without these, the reported robustness under severe obfuscation cannot be fully assessed for overfitting or selection bias.
Authors: We appreciate this request for greater methodological transparency. The current manuscript describes the benchmark construction and evaluation protocol at a high level, but we agree that explicit details on splits, leakage prevention, and significance testing are needed to fully substantiate the generalization and robustness claims. In the revised version, we will expand the 'Dataset Construction' and 'Experimental Setup' sections to detail: (1) the train/test split methodology ensuring no program or obfuscation configuration overlap (via unique source identifiers and hash-based deduplication); (2) verification steps confirming absence of leakage; and (3) statistical significance tests (e.g., paired Wilcoxon signed-rank tests with Bonferroni correction) applied to performance differences across ISAs and optimization levels. These additions will enable readers to rigorously evaluate potential overfitting or selection bias. revision: yes
Circularity Check
No significant circularity in empirical benchmark study
full rationale
This is an empirical benchmark paper introducing BinDeObfBench and reporting LLM evaluation results across obfuscation transformations, ISAs, and optimization levels. No mathematical derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction. Central claims rest on experimental scores from the defined benchmark rather than any self-referential fitting, renaming, or self-citation load-bearing step. The study is self-contained against its own externally specified transformations and automatic metrics; no patterns from the enumerated circularity kinds are exhibited.
Axiom & Free-Parameter Ledger
invented entities (1)
-
BinDeObfBench
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present BINDEOBFBENCH, the first comprehensive benchmark for assessing LLM-based binary deobfuscation across diverse transformations... deobfuscation performance depends more on reasoning capability and domain expertise than on model scale
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design four evaluation metrics... lexical consistency (BLEU), semantic preservation (Dual-Perspective Semantic Fusion), code simplicity (token-wise delta entropy), code readability (Halstead)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
The obfuscation executive
Kelly Heffner and Christian Collberg. The obfuscation executive. InInternational Conference on Information Security, pages 428–440. Springer, 2004
2004
-
[2]
Protecting software through obfuscation: Can it keep pace with progress in code analysis?Acm computing surveys (csur), 49(1):1–37, 2016
Sebastian Schrittwieser, Stefan Katzenbeisser, Johannes Kinder, Georg Merzdovnik, and Edgar Weippl. Protecting software through obfuscation: Can it keep pace with progress in code analysis?Acm computing surveys (csur), 49(1):1–37, 2016
2016
-
[3]
Chosen-instruction attack against commercial code virtualization obfuscators
Shijia Li, Chunfu Jia, Pengda Qiu, Qiyuan Chen, Jiang Ming, and Debin Gao. Chosen-instruction attack against commercial code virtualization obfuscators. InIn Proceedings of the 29th Network and Distributed System Security Symposium, 2022
2022
-
[4]
Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools
Sangjun Ko, Jusop Choi, and Hyoungshick Kim. Coat: Code obfuscation tool to evaluate the performance of code plagiarism detection tools. In2017 International conference on software security and assurance (ICSSA), pages 32–37. IEEE, 2017
2017
-
[5]
Malware obfuscation techniques: A brief survey
Ilsun You and Kangbin Yim. Malware obfuscation techniques: A brief survey. In2010 International conference on broadband, wireless computing, communication and applications, pages 297–300. IEEE, 2010
2010
-
[6]
Binaryai: Binary software composition analysis via intelligent binary source code matching
Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, and Yuqun Zhang. Binaryai: Binary software composition analysis via intelligent binary source code matching. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024
2024
-
[7]
Poster: E-graphs and equality saturation for term-rewriting in mba deobfuscation: An empirical study
Seoksu Lee, Hyeongchang Jeon, and Eun-Sun Cho. Poster: E-graphs and equality saturation for term-rewriting in mba deobfuscation: An empirical study. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4985–4987, 2024
2024
-
[8]
Simplifying mixed boolean-arithmetic obfuscation by program synthesis and term rewriting
Jaehyung Lee and Woosuk Lee. Simplifying mixed boolean-arithmetic obfuscation by program synthesis and term rewriting. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2351–2365, 2023
2023
-
[9]
A generic approach to automatic deobfuscation of executable code
Babak Yadegari, Brian Johannesmeyer, Ben Whitely, and Saumya Debray. A generic approach to automatic deobfuscation of executable code. In2015 IEEE Symposium on Security and Privacy, pages 674–691. IEEE, 2015
2015
-
[10]
Sok: Automatic deobfuscation of virtualization-protected applications
Patrick Kochberger, Sebastian Schrittwieser, Stefan Schweighofer, Peter Kieseberg, and Edgar Weippl. Sok: Automatic deobfuscation of virtualization-protected applications. InProceedings of the 16th International Conference on Availability, Reliability and Security, pages 1–15, 2021
2021
-
[11]
Control-flow deobfuscation using trace-informed compositional program synthesis.Proceedings of the ACM on Programming Languages, 8(OOPSLA2):2211–2241, 2024
Benjamin Mariano, Ziteng Wang, Shankara Pailoor, Christian Collberg, and I¸ sil Dillig. Control-flow deobfuscation using trace-informed compositional program synthesis.Proceedings of the ACM on Programming Languages, 8(OOPSLA2):2211–2241, 2024
2024
-
[12]
{MBA-Blast}: Unveiling and simplifying mixed {Boolean-Arithmetic} obfuscation
Binbin Liu, Junfu Shen, Jiang Ming, Qilong Zheng, Jing Li, and Dongpeng Xu. {MBA-Blast}: Unveiling and simplifying mixed {Boolean-Arithmetic} obfuscation. In30th USENIX Security Symposium (USENIX Security 21), pages 1701–1718, 2021. 16 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation
2021
-
[13]
Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering, 2024
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shuvendu K Lahiri. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering, 2024
2024
-
[14]
Automated program repair in the era of large pre-trained language models
Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023
2023
-
[15]
Mingxuan Zhang, Bo Yuan, Hanzhe Li, and Kangming Xu. Llm-cloud complete: Leveraging cloud computing for efficient large language model-based code completion.Journal of Artificial Intelligence General science (JAIGS) ISSN: 3006-4023, 5(1):295–326, 2024
2024
-
[16]
Degpt: Optimizing decompiler output with llm
Peiwei Hu, Ruigang Liang, and Kai Chen. Degpt: Optimizing decompiler output with llm. InProceedings 2024 Network and Distributed System Security Symposium, volume 267622140, 2024
2024
-
[17]
How far have we gone in binary code understanding using large language models
Xiuwei Shang, Shaoyin Cheng, Guoqiang Chen, Yanming Zhang, Li Hu, Xiao Yu, Gangyang Li, Weiming Zhang, and Nenghai Yu. How far have we gone in binary code understanding using large language models. In2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), pages 1–12. IEEE, 2024
2024
-
[18]
Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, and Nenghai Yu. Binmetric: A comprehensive binary analysis benchmark for large language models.arXiv preprint arXiv:2505.07360, 2025
-
[19]
Llm4decompile: Decom- piling binary code with large language models,
Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. Llm4decompile: Decompiling binary code with large language models.arXiv preprint arXiv:2403.05286, 2024
-
[20]
Beyond classification: Inferring function names in stripped binaries via domain adapted llms
Linxi Jiang, Xin Jin, and Zhiqiang Lin. Beyond classification: Inferring function names in stripped binaries via domain adapted llms. InProceedings of the 2025 on ACM SIGSAC Conference on Computer and Communications Security, 2025
2025
-
[21]
Misum: Multi-modality heterogeneous code graph learning for multi-intent binary code summarization
Kangchen Zhu, Zhiliang Tian, Shangwen Wang, Weiguo Chen, Zixuan Dong, Mingyue Leng, and Xiaoguang Mao. Misum: Multi-modality heterogeneous code graph learning for multi-intent binary code summarization. Proceedings of the ACM on Software Engineering, 2(FSE):1339–1362, 2025
2025
-
[22]
Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. Binary code summarization: Benchmarking chatgpt/gpt- 4 and other large language models.arXiv preprint arXiv:2312.09601, 2023
-
[23]
Typeforge: Synthesizing and selecting best-fit composite data types for stripped binaries
Yanzhong Wang, Ruigang Liang, Yilin Li, Peiwei Hu, Kai Chen, and Bolun Zhang. Typeforge: Synthesizing and selecting best-fit composite data types for stripped binaries. In2025 IEEE Symposium on Security and Privacy (SP), pages 1–18. IEEE, 2025
2025
-
[24]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Qwen3 technical report, 2025
Qwen Team. Qwen3 technical report, 2025
2025
-
[26]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/blog/meta-llama-3-1/, 2025
Meta AI. Introducing llama 3.1: Our most capable models to date.https://ai.meta.com/blog/meta-llama-3-1/, 2025
2025
-
[28]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Hello gpt-4 turbo.https://openai.com/index/hello-gpt-4o/, 2024
OpenAI. Hello gpt-4 turbo.https://openai.com/index/hello-gpt-4o/, 2024
2024
-
[30]
OpenAI.https://openai.com/o1/, 2024
2024
-
[31]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025
Guoqiang Chen, Huiqi Sun, Daguang Liu, Zhiqi Wang, Qiang Wang, Bin Yin, Lu Liu, and Lingyun Ying. Recopilot: Reverse engineering copilot in binary analysis.arXiv preprint arXiv:2505.16366, 2025
-
[33]
Chatdeob: An effective deobfuscation method based on large language model
Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. Chatdeob: An effective deobfuscation method based on large language model. InInternational Conference on Information Security Applications, pages 151–163. Springer, 2024
2024
-
[34]
D810.https://github.com/joydo/d810
joydo. D810.https://github.com/joydo/d810. 17 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation
-
[35]
Goomba.https://hex-rays.com/blog/deobfuscation-with-goomba
Hex-Rays. Goomba.https://hex-rays.com/blog/deobfuscation-with-goomba
-
[36]
PhD thesis, Université Grenoble Alpes, 2019
Matthieu Tofighi Shirazi.Analysis of obfuscation transformations on binary code. PhD thesis, Université Grenoble Alpes, 2019
2019
-
[37]
Defeating opaque predicates statically through machine learning and binary analysis
Ramtine Tofighi-Shirazi, Irina-Mariuca Asavoae, Philippe Elbaz-Vincent, and Thanh-Ha Le. Defeating opaque predicates statically through machine learning and binary analysis. InProceedings of the 3rd ACM Workshop on Software Protection, pages 3–14, 2019
2019
-
[38]
X- mba: Towards heterogeneous mixed boolean-arithmetic deobfuscation
Gengwang Li, Min Yu, Dongliang Fang, Gang Li, Xiang Meng, Jiangguo Jiang, and Weiqing Huang. X- mba: Towards heterogeneous mixed boolean-arithmetic deobfuscation. InMILCOM 2024-2024 IEEE Military Communications Conference (MILCOM), pages 1082–1087. IEEE, 2024
2024
-
[39]
Dose: Deobfuscation based on semantic equivalence
Ramtine Tofighi-Shirazi, Maria Christofi, Philippe Elbaz-Vincent, and Thanh-Ha Le. Dose: Deobfuscation based on semantic equivalence. InProceedings of the 8th Software Security, Protection, and Reverse Engineering Workshop, pages 1–12, 2018
2018
-
[40]
Search-based local black-box deobfuscation: understand, improve and mitigate
Grégoire Menguy, Sébastien Bardin, Richard Bonichon, and Cauim de Souza Lima. Search-based local black-box deobfuscation: understand, improve and mitigate. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 2513–2525, 2021
2021
-
[41]
Input-output example-guided data deobfuscation on binary.Security and Communication Networks, 2021(1):4646048, 2021
Yujie Zhao, Zhanyong Tang, Guixin Ye, Xiaoqing Gong, and Dingyi Fang. Input-output example-guided data deobfuscation on binary.Security and Communication Networks, 2021(1):4646048, 2021
2021
-
[42]
Qsynth-a program synthesis based approach for binary code deobfuscation
Robin David, Luigi Coniglio, Mariano Ceccato, et al. Qsynth-a program synthesis based approach for binary code deobfuscation. InBAR 2020 Workshop, 2020
2020
-
[43]
Syntia: Synthesizing the semantics of obfuscated code
Tim Blazytko, Moritz Contag, Cornelius Aschermann, and Thorsten Holz. Syntia: Synthesizing the semantics of obfuscated code. In26th USENIX Security Symposium (USENIX Security 17), pages 643–659, 2017
2017
-
[44]
Seead: A semantic-based approach for automatic binary code de-obfuscation
Zhanyong Tang, Kaiyuan Kuang, Lei Wang, Chao Xue, Xiaoqing Gong, Xiaojiang Chen, Dingyi Fang, Jie Liu, and Zheng Wang. Seead: A semantic-based approach for automatic binary code de-obfuscation. In2017 IEEE Trustcom/BigDataSE/ICESS, pages 261–268. IEEE, 2017
2017
-
[45]
Exploring the potential of llms for code deobfuscation
David Beste, Grégoire Menguy, Hossein Hajipour, Mario Fritz, Antonio Emanuele Cinà, Sébastien Bardin, Thorsten Holz, Thorsten Eisenhofer, and Lea Schönherr. Exploring the potential of llms for code deobfuscation. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 267–286. Springer, 2025
2025
-
[46]
Dobf: A deobfuscation pre-training objective for programming languages.Advances in Neural Information Processing Systems, 34:14967– 14979, 2021
Marie-Anne Lachaux, Baptiste Roziere, Marc Szafraniec, and Guillaume Lample. Dobf: A deobfuscation pre-training objective for programming languages.Advances in Neural Information Processing Systems, 34:14967– 14979, 2021
2021
-
[47]
Alfredo: Agentic llm-based framework for code deobfuscation
Ching Yuhui Natalie, Sophie Tung Xuan Ying, and Siow Jing Kai. Alfredo: Agentic llm-based framework for code deobfuscation
-
[48]
Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation
Seyedreza Mohseni, Seyedali Mohammadi, Deepa Tilwani, Yash Saxena, Gerald Ketu Ndawula, Sriram Vema, Edward Raff, and Manas Gaur. Can llms obfuscate code? a systematic analysis of large language models into assembly code obfuscation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24893–24901, 2025
2025
-
[49]
Anton Tkachenko, Dmitrij Suskevic, and Benjamin Adolphi. Deconstructing obfuscation: A four-dimensional framework for evaluating large language models assembly code deobfuscation capabilities.arXiv preprint arXiv:2505.19887, 2025
-
[50]
Enabling obfuscation detection in binary software through explainable ai.IEEE Transactions on Emerging Topics in Computing, 2024
Claudia Greco, Michele Ianni, Antonella Guzzo, and Giancarlo Fortino. Enabling obfuscation detection in binary software through explainable ai.IEEE Transactions on Emerging Topics in Computing, 2024
2024
-
[51]
Debra: A real-world benchmark for evaluating deobfuscation methods
Zheyun Feng and Dongpeng Xu. Debra: A real-world benchmark for evaluating deobfuscation methods. In Proceedings of the 2025 Workshop on Software Understanding and Reverse Engineering, pages 76–88, 2025
2025
-
[52]
Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning
B Sebastian, C Christian, and P Alexander. Predicting the resilience of obfuscated code against symbolic execution attacks via machine learning. InProceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, pages 16–18, 2017
2017
-
[53]
Mibench: A free, commercially representative embedded benchmark suite
Matthew R Guthaus, Jeffrey S Ringenberg, Dan Ernst, Todd M Austin, Trevor Mudge, and Richard B Brown. Mibench: A free, commercially representative embedded benchmark suite. InProceedings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538), pages 3–14. IEEE, 2001
2001
-
[54]
Spec cpu2006 benchmark descriptions.ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006
John L Henning. Spec cpu2006 benchmark descriptions.ACM SIGARCH Computer Architecture News, 34(4):1–17, 2006. 18 Can LLMs Deobfuscate Binary Code? A Systematic Analysis of LLMs into Pseudocode Deobfuscation
2006
-
[55]
Xiuwei Shang, Zhenkan Fu, Shaoyin Cheng, Guoqiang Chen, Gangyang Li, Li Hu, Weiming Zhang, and Nenghai Yu. An empirical study on the effectiveness of large language models for binary code understanding.arXiv preprint arXiv:2504.21803, 2025
-
[56]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
2002
-
[57]
arXiv preprint arXiv:2105.12655 (2021)
Ruchir Puri, David S Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir Choudhury, Lindsey Decker, et al. Codenet: A large-scale ai for code dataset for learning a diversity of coding tasks.arXiv preprint arXiv:2105.12655, 2021
-
[58]
Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
1901
-
[59]
Evalplus leaderboard.https://evalplus.github.io/leaderboard.html, 2024
2024
-
[60]
arXiv preprint arXiv:2102.04664 (2021) 16 A
Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664, 2021
-
[61]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.(2019).arXiv preprint arXiv:1909.09436, 1909
work page internal anchor Pith review arXiv 2019
-
[62]
Ollvm.https://github.com/obfuscator-llvm/obfuscator
obfuscator llvm. Ollvm.https://github.com/obfuscator-llvm/obfuscator
-
[63]
Hikari.https://github.com/HikariObfuscator/Hikari
HikariObfuscator. Hikari.https://github.com/HikariObfuscator/Hikari
-
[64]
Tigress.https://tigress.wtf
Arizona. Tigress.https://tigress.wtf
-
[65]
Alcatraz.https://github.com/weak1337/Alcatraz
weak1337. Alcatraz.https://github.com/weak1337/Alcatraz
-
[66]
Loop: Logic-oriented opaque predicate detection in obfuscated binary code
Jiang Ming, Dongpeng Xu, Li Wang, and Dinghao Wu. Loop: Logic-oriented opaque predicate detection in obfuscated binary code. InProceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 757–768, 2015
2015
-
[67]
Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971
Joseph L Fleiss. Measuring nominal scale agreement among many raters.Psychological bulletin, 76(5):378, 1971
1971
-
[68]
thezoo.https://github.com/ytisf/theZoo
ytisf. thezoo.https://github.com/ytisf/theZoo
-
[69]
Malwaresourcecode.https://github.com/vxunderground/MalwareSourceCode
vxunderground. Malwaresourcecode.https://github.com/vxunderground/MalwareSourceCode
-
[70]
A survey of large language models for code: Evolution, benchmarking, and future trends
Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. A survey of large language models for code: Evolution, benchmarking, and future trends.arXiv preprint arXiv:2311.10372, 2023
-
[71]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[72]
OpenAI.https://platform.openai.com/docs/models/gpt-3.5-turbo, 2023
2023
-
[73]
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024
-
[74]
When text embedding meets large language model: a comprehensive survey
Zhijie Nie, Zhangchi Feng, Mingxin Li, Cunwang Zhang, Yanzhao Zhang, Dingkun Long, and Richong Zhang. When text embedding meets large language model: a comprehensive survey.arXiv preprint arXiv:2412.09165, 2024
-
[75]
Ye Liu, Rui Meng, Shafiq Joty, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Codexembed: A generalist embedding model family for multiligual and multi-task code retrieval.arXiv preprint arXiv:2411.12644, 2024
-
[76]
Efficient code embeddings from code generation models.arXiv preprint arXiv:2508.21290, 2025
Daria Kryvosheieva, Saba Sturua, Michael Günther, Scott Martens, and Han Xiao. Efficient code embeddings from code generation models.arXiv preprint arXiv:2508.21290, 2025
-
[77]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[78]
Cambridge university press, 2020
Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman.Mining of massive data sets. Cambridge university press, 2020
2020
-
[79]
The dimensionality of program complexity
John C Munson and Taghi M Khoshgoftaar. The dimensionality of program complexity. InProceedings of the 11th international conference on Software engineering, pages 245–253, 1989
1989
-
[80]
Software complexity analysis using halstead metrics
T Hariprasad, G Vidhyagaran, K Seenu, and Chandrasegar Thirumalai. Software complexity analysis using halstead metrics. In2017 international conference on trends in electronics and informatics (ICEI), pages 1109–1113. IEEE, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.