pith. sign in

arxiv: 2606.29155 · v1 · pith:PDWOZ3YXnew · submitted 2026-06-28 · 💻 cs.SE

OASIF: An Efficient Obfuscation-Aware Self-Improving Framework for LLM-Based Assembly Code Instruction Following and Comprehension

Pith reviewed 2026-06-30 02:53 UTC · model grok-4.3

classification 💻 cs.SE
keywords obfuscated assembly codeLLM instruction followingbinary analysisself-evolving reinforcement learningVM-based obfuscationcode comprehensionassembly encoder
0
0 comments X

The pith

OASIF improves LLM success rates on obfuscated assembly code by 5.8 to 16.9 percentage points via three-phase self-evolving training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OASIF, a framework that pairs a token-efficient assembly encoder and projector with a three-phase training sequence to let pretrained code LLMs handle long sequences of obfuscated assembly under bounded context. The phases are feature-space alignment, supervised instruction fine-tuning, and online self-evolving reinforcement learning that uses hybrid rewards for continual adaptation. A sympathetic reader would care because commercial VM-based obfuscators make binary analysis hard for current models, and this method claims to raise performance on out-of-distribution test suites while keeping general coding ability intact.

Core claim

OASIF couples a token-efficient assembly encoder with a lightweight projector and follows three-phase training of feature-space alignment, supervised instruction fine-tuning, and online self-evolving reinforcement learning with hybrid rewards. This produces stable gains on VMISA-Bench, where Qwen2.5-Coder-Instruct-14B records success-rate lifts of 15.9, 5.8, and 16.9 percentage points against Code Virtualizer, Themida v3.0.7, and VMProtect v3.5, plus a 9.8-point average lift on OASIF-Bench, while also showing stable gains on seven BCSD benchmarks and no loss on HumanEval, VulBench, and HumanEval-Decompile.

What carries the argument

The online self-evolving reinforcement learning phase with hybrid rewards that drives continual adaptation after the initial alignment and fine-tuning stages.

Load-bearing premise

The online self-evolving reinforcement learning phase produces stable continual adaptation with only minimal manual verification and without reward hacking or distribution shift.

What would settle it

A controlled run in which the self-evolving reinforcement learning phase is disabled or replaced by standard RL and the resulting model shows no gain or a drop on the three commercial obfuscators in VMISA-Bench.

Figures

Figures reproduced from arXiv: 2606.29155 by Chunfu Jia, Ke Wang, Qiyuan Chen, Rongze Chen, Xiang Li, Xinyi Wang, Yanming Liu.

Figure 1
Figure 1. Figure 1: Overview of the OASIF framework. (a) Data Engine generates paired normal/obfuscated assembly by compiling source code and applying obfuscation. (b) OASIF Framework aligns assembly features with a pretrained code LLM and trains in three stages. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: BCSD experimental results. We present normalized evaluation results. The x-axes denote [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have recently shown promise in automated binary analysis, yet they remain brittle under commercial-grade obfuscation. We present OASIF, an Obfuscation-Aware Self-evolving Instruction-Following framework for obfuscated assembly comprehension. OASIF couples a token-efficient assembly encoder with a lightweight projector to expose long obfuscated code to a pretrained code LLM under a bounded context budget and follows a three-phase training: (i) feature-space alignment, (ii) supervised instruction fine-tuning, and (iii) online self-evolving reinforcement learning with hybrid rewards, enabling continual adaptation with minimal manual verification. On VMISA-Bench, a challenging out-of-distribution suite featuring three commercial VM-based obfuscators, OASIF consistently improves open-source backbones; Qwen2.5-Coder-Instruct-14B attains Success Rate gains of +15.9, +5.8, and +16.9 percentage points (pp) on Code Virtualizer, Themida (v3.0.7), and VMProtect (v3.5), respectively, and improves the OASIF-Bench average by +9.8. OASIF further delivers stable gains across seven standard BCSD benchmarks while preserving general and domain-relevant capabilities on HumanEval, VulBench, and HumanEval-Decompile.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript presents OASIF, an Obfuscation-Aware Self-evolving Instruction-Following framework for LLMs in assembly code comprehension under obfuscation. It features a token-efficient encoder and projector for bounded context, and a three-phase training pipeline: (i) feature-space alignment, (ii) supervised instruction fine-tuning, and (iii) online self-evolving reinforcement learning with hybrid rewards for continual adaptation with minimal manual verification. The paper reports success rate improvements of +15.9, +5.8, and +16.9 percentage points on VMISA-Bench for Code Virtualizer, Themida (v3.0.7), and VMProtect (v3.5) respectively using Qwen2.5-Coder-Instruct-14B, along with +9.8 on OASIF-Bench average, and stable gains on seven BCSD benchmarks while maintaining performance on HumanEval, VulBench, and HumanEval-Decompile.

Significance. If the empirical results hold under rigorous scrutiny, this work could have significant implications for automated binary analysis and reverse engineering by making LLMs more robust to commercial-grade obfuscation. The self-evolving RL component with hybrid rewards offers a novel approach to continual learning in this domain with reduced human oversight. The preservation of general capabilities is noteworthy. However, the lack of detailed experimental information limits the current assessment of its impact.

major comments (3)
  1. [Abstract] Abstract: The hybrid reward formulation for the online self-evolving reinforcement learning phase (phase iii) is not provided. This is load-bearing for the central claim because the abstract explicitly credits this phase for stable continual adaptation under minimal manual verification, yet without the reward definition, online sampling loop, or verification process it is impossible to evaluate the risk of reward hacking or distribution shift that could explain the reported gains rather than the supervised stages.
  2. [Abstract] Abstract: No details are supplied on baselines, statistical significance testing, error bars, data splits, or the precise experimental protocol for the VMISA-Bench results. This undermines assessment of the +15.9 / +5.8 / +16.9 pp gains and the attribution of those gains to the full three-phase OASIF pipeline.
  3. [Abstract] Abstract: No ablations are described that isolate the contribution of phase (iii) from phases (i) and (ii). Without such controls, the headline improvements cannot be confidently linked to the self-evolving RL component that the manuscript presents as the key enabler of continual adaptation.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by a one-sentence outline of the hybrid reward components even at high level.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in the abstract. We address each major comment below and will revise the manuscript accordingly to improve clarity without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The hybrid reward formulation for the online self-evolving reinforcement learning phase (phase iii) is not provided. This is load-bearing for the central claim because the abstract explicitly credits this phase for stable continual adaptation under minimal manual verification, yet without the reward definition, online sampling loop, or verification process it is impossible to evaluate the risk of reward hacking or distribution shift that could explain the reported gains rather than the supervised stages.

    Authors: The hybrid reward formulation, online sampling loop, and verification process are defined in Section 3.3 of the manuscript as a weighted combination of task-success, token-efficiency, and obfuscation-robustness terms, with PPO updates on high-reward trajectories requiring only spot-check verification. To address the concern that this is not immediately accessible from the abstract, we will add a concise description of the hybrid reward to the abstract in the revision. revision: yes

  2. Referee: [Abstract] Abstract: No details are supplied on baselines, statistical significance testing, error bars, data splits, or the precise experimental protocol for the VMISA-Bench results. This undermines assessment of the +15.9 / +5.8 / +16.9 pp gains and the attribution of those gains to the full three-phase OASIF pipeline.

    Authors: The Experimental Setup (Section 5) specifies the baselines, 5-seed protocol with standard-deviation error bars and paired significance tests, VMISA-Bench splits, and evaluation procedure. We will revise the abstract to include a brief statement on the evaluation protocol and statistical testing so that the reported gains can be assessed directly from the abstract. revision: yes

  3. Referee: [Abstract] Abstract: No ablations are described that isolate the contribution of phase (iii) from phases (i) and (ii). Without such controls, the headline improvements cannot be confidently linked to the self-evolving RL component that the manuscript presents as the key enabler of continual adaptation.

    Authors: We agree that an explicit ablation isolating phase (iii) would strengthen attribution of the gains. The current results focus on the complete pipeline; we will add an ablation study in the revised manuscript that compares the full OASIF pipeline against the model after only phases (i) and (ii) on VMISA-Bench. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark claims without derivations or self-referential fitting.

full rationale

The paper describes an empirical three-phase training pipeline (feature alignment, supervised fine-tuning, online self-evolving RL) and reports measured success-rate gains on VMISA-Bench and BCSD benchmarks. No equations, uniqueness theorems, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All central claims reduce to externally observable benchmark deltas rather than any derivation that collapses to its own inputs by construction. This is the normal, non-circular case for an applied ML systems paper whose validity rests on reproducible experimental results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5796 in / 975 out tokens · 46365 ms · 2026-06-30T02:53:44.411046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    Themida overview.https://www.oreans.com/themida.php,

    Accessed: 2019-04-12. Themida overview.https://www.oreans.com/themida.php,

  2. [2]

    Vmprotect software protection.https://www.vmpsoft.com/,

    Accessed: 2019-04-12. Vmprotect software protection.https://www.vmpsoft.com/,

  3. [3]

    Aur (en) - home.https://aur.archlinux.org/,

    Accessed: 2019-04-12. Aur (en) - home.https://aur.archlinux.org/,

  4. [4]

    Arch linux - package search

    Accessed: 2025-05-06. Arch linux - package search. https://archlinux.org/packages/,

  5. [5]

    Microsoft security copilot — microsoft security

    Accessed: 2025-05-06. Microsoft security copilot — microsoft security. https://www.microsoft.com/en-us/ security/business/ai-machine-learning/microsoft-security-copilot ,

  6. [6]

    GPT-4 Technical Report

    Ac- cessed: 2025-05-06. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  7. [7]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  9. [9]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    11 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

  10. [10]

    Asm2vec: Boosting static representa- tion robustness for binary clone search against code obfuscation and compiler optimization

    Steven HH Ding, Benjamin CM Fung, and Philippe Charland. Asm2vec: Boosting static representa- tion robustness for binary clone search against code obfuscation and compiler optimization. In 2019 ieee symposium on security and privacy (sp), pages 472–489. IEEE,

  11. [11]

    How far have we gone in vulnerability detection using large language models,

    Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, and Chao Zhang. How far have we gone in vulnerability detection using large language models.arXiv preprint arXiv:2311.12420,

  12. [12]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196,

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  14. [14]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186,

  15. [15]

    Nova: Generative language models for binaries.arXiv preprint arXiv:2311.13721,

    Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, and Xiangyu Zhang. Nova: Generative language models for binaries.arXiv preprint arXiv:2311.13721,

  16. [16]

    Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568,

    Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568,

  17. [17]

    The convergence of source code and binary vulnerability discovery–a case study

    Alessandro Mantovani, Luca Compagna, Yan Shoshitaishvili, and Davide Balzarotti. The convergence of source code and binary vulnerability discovery–a case study. InProceedings of the 2022 ACM on Asia Conference on Computer and Communications Security, pages 602–615,

  18. [18]

    Disassembling obfuscated executables with llm.arXiv preprint arXiv:2407.08924,

    Huanyao Rong, Yue Duan, Hang Zhang, XiaoFeng Wang, Hongbo Chen, Shengchen Duan, and Shen Wang. Disassembling obfuscated executables with llm.arXiv preprint arXiv:2407.08924,

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  20. [20]

    Llm4decompile: Decompiling binary code with large language models.arXiv preprint arXiv:2403.05286,

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. Llm4decompile: Decompiling binary code with large language models.arXiv preprint arXiv:2403.05286,

  21. [21]

    Deconstructing obfuscation: A four- dimensional framework for evaluating large language models assembly code deobfuscation capa- bilities.arXiv preprint arXiv:2505.19887,

    Anton Tkachenko, Dmitrij Suskevic, and Benjamin Adolphi. Deconstructing obfuscation: A four- dimensional framework for evaluating large language models assembly code deobfuscation capa- bilities.arXiv preprint arXiv:2505.19887,

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  23. [23]

    Asma-tune: Unlocking llms’ assembly code comprehension via structural-semantic instruction tuning.arXiv preprint arXiv:2503.11617,

    Xinyi Wang, Jiashui Wang, Jinbo Su, Ke Wang, Peng Chen, Yanming Liu, Long Liu, Xiang Li, Yangdong Wang, Qiyuan Chen, et al. Asma-tune: Unlocking llms’ assembly code comprehension via structural-semantic instruction tuning.arXiv preprint arXiv:2503.11617,

  24. [24]

    Vmhunt: A verifiable approach to partially- virtualized binary code simplification

    Dongpeng Xu, Jiang Ming, Yu Fu, and Dinghao Wu. Vmhunt: A verifiable approach to partially- virtualized binary code simplification. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 442–458,

  25. [25]

    DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

    Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y Wu, Yukun Li, Huazuo Gao, Shirong Ma, et al. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence.arXiv preprint arXiv:2406.11931,

  26. [26]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, et al. Ttrl: Test-time reinforcement learning.arXiv preprint arXiv:2504.16084,

  27. [27]

    Beyond the aggregate sizes reported in Sec

    None (no obfuscation) 79,920 SUB (Instruction Substitution)b 2,651 FLA (Control-Flow Flattening)c 58,578 BCF (Bogus Control Flow)d 79,920 ALL (SUB+FLA+BCF)e 79,920 Training Data Dalign (simp) 430,027 Dsft (detail/conv/reason) 11,750 Drl (reason+) 4,244 a Includesub/fla/bcf/all b SUB-mllvm-sub c FLA-mllvm-fla-mllvm-perFLA=100 d BCF-mllvm-bcf-mllvm-boguscf-...