pith. machine review for the scientific record. sign in

arxiv: 2604.06095 · v1 · submitted 2026-04-07 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

LLM4CodeRE: Generative AI for Code Decompilation Analysis and Reverse Engineering

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords decompilationreverse engineeringlarge language modelsmalware analysiscode translationfine-tuningassembly codebidirectional generation
0
0 comments X

The pith

A domain-adapted LLM enables accurate bidirectional translation between assembly and source code for malware reverse engineering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM4CodeRE, a large language model framework adapted specifically for code decompilation and reverse engineering of malicious software. It develops two fine-tuning strategies to let a single model perform both assembly-to-source decompilation and source-to-assembly translation. This matters because standard decompilation tools often fail against the obfuscation used in modern malware, and improved AI performance could speed up security analysis. The work shows that targeted adaptation outperforms both traditional decompilers and generic code models on these tasks.

Core claim

LLM4CodeRE is a domain-adaptive LLM framework for bidirectional code reverse engineering that supports assembly-to-source decompilation and source-to-assembly translation within a unified model. It uses a Multi-Adapter approach for task-specific syntactic and semantic alignment together with a Seq2Seq Unified approach that applies task-conditioned prefixes to enforce end-to-end generation constraints. Experimental results demonstrate that this model outperforms existing decompilation tools and general-purpose code models while achieving robust bidirectional generalization.

What carries the argument

The LLM4CodeRE model, which combines Multi-Adapter fine-tuning for alignment with Seq2Seq Unified training using task-conditioned prefixes to control generation across both translation directions.

If this is right

  • A single model can reliably perform both decompilation and assembly generation for reverse engineering tasks.
  • Domain-specific adaptation yields higher accuracy than generic large language models on malware code.
  • The approach reduces dependence on separate specialized tools for each translation direction.
  • Improved handling of obfuscated code supports faster identification of malicious functionality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same adaptation pattern could be tested on other low-level code tasks such as binary analysis or embedded system code.
  • Integration into existing analyst platforms might allow real-time suggestions during manual review sessions.
  • Collecting larger and more varied sets of obfuscated samples could further strengthen generalization.
  • The unified bidirectional capability opens the possibility of iterative refinement where generated source code is re-assembled and checked for consistency.

Load-bearing premise

The domain-adaptive fine-tuning strategies will generalize effectively to real-world malicious software without significant overfitting or loss of performance on diverse obfuscation techniques.

What would settle it

Performance measurements on a new collection of previously unseen obfuscated malware binaries where the model's decompilation accuracy falls below that of standard tools such as Ghidra or IDA Pro.

Figures

Figures reproduced from arXiv: 2604.06095 by Ali A. Ghorbani, Hamed Jelodar, Mohammad Meymani, Parisa Hamedi, Roozbeh Razavi-Far, Samita Bai, Tochukwu Emmanuel Nwankwo.

Figure 1
Figure 1. Figure 1: System pipeline of the proposed LLM4CodeRE framework. Malware binaries are disassembled and normalized into paired assembly and source representations. A backbone LLM is pretrained using a causal language modeling (CLM) objective on malware corpora. Task adaptation is performed using LoRA and task-specific adapters or Seq2Seq Unified prefix tokens. The unified model supports bidirectional code transformati… view at source ↗
Figure 2
Figure 2. Figure 2: Relationship between fine-tuning strategies and trained models for decompiling tasks. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pareto chart of the dedicated compiled count on the PE-Machine Learning-200 dataset. Bars (left axis) show per-model counts (sorted in descending [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Perplexity comparison across datasets for four language models under two settings: Original (non–fine-tuned) and Domain (domain–fine-tuned). Left: [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative evaluation of conversion quality in both directions. (a) Asm-to-Src and (b) Src-to-Asm: Edit similarity and semantic similarity across [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Re-executability scores (percentage of translated programs that recompile and execute successfully) for assembly-to-source (Asm [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Code decompilation analysis is a fundamental yet challenging task in malware reverse engineering, particularly due to the pervasive use of sophisticated obfuscation techniques. Although recent large language models (LLMs) have shown promise in translating low-level representations into high-level source code, most existing approaches rely on generic code pretraining and lack adaptation to malicious software. We propose LLM4CodeRE, a domain-adaptive LLM framework for bidirectional code reverse engineering that supports both assembly-to-source decompilation and source-to-assembly translation within a unified model. To enable effective task adaptation, we introduce two complementary fine-tuning strategies: (i) a Multi-Adapter approach for task-specific syntactic and semantic alignment, and (ii) a Seq2Seq Unified approach using task-conditioned prefixes to enforce end-to-end generation constraints. Experimental results demonstrate that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models, achieving robust bidirectional generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes LLM4CodeRE, a domain-adaptive LLM framework for bidirectional code reverse engineering in malware analysis. It supports assembly-to-source decompilation and source-to-assembly translation in a single model and introduces two fine-tuning strategies: (i) Multi-Adapter for task-specific syntactic and semantic alignment and (ii) Seq2Seq Unified with task-conditioned prefixes. The central claim, based on experimental results, is that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models while achieving robust bidirectional generalization.

Significance. If the experimental claims hold with rigorous validation, the work could meaningfully advance malware reverse engineering by providing a unified generative model adapted to malicious code, potentially improving analysis of obfuscated binaries over generic LLMs or traditional decompilers. The bidirectional capability and domain-adaptive strategies are conceptually promising contributions to the intersection of LLMs and security.

major comments (3)
  1. Abstract: The claim that 'Experimental results demonstrate that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models, achieving robust bidirectional generalization' is presented without any details on datasets, metrics, baselines, or controls. This information is load-bearing for the central claim and must be supplied to allow verification.
  2. Experimental evaluation: No evidence is provided that the training or test distributions include realistic malware obfuscations such as control-flow flattening, packing, or virtualization. Without such coverage and corresponding performance breakdowns, the claim of robust generalization to real-world malicious software cannot be supported.
  3. §3 (fine-tuning strategies): The Multi-Adapter and Seq2Seq Unified approaches are described at a high level in terms of syntactic/semantic alignment and task prefixes, but no analysis is given of how these mechanisms prevent overfitting to non-obfuscated corpora or ensure adaptation to the distribution of actual malicious binaries.
minor comments (1)
  1. The abstract would be clearer if it stated the base LLM, model size, and number of training examples used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: Abstract: The claim that 'Experimental results demonstrate that LLM4CodeRE outperforms existing decompilation tools and general-purpose code models, achieving robust bidirectional generalization' is presented without any details on datasets, metrics, baselines, or controls. This information is load-bearing for the central claim and must be supplied to allow verification.

    Authors: We agree that the abstract would benefit from additional context. In the revised version, we will expand the abstract with a concise summary of the key datasets (assembly-to-source pairs drawn from open-source and malware corpora), primary metrics (BLEU-4, exact match, and CodeBLEU), and main baselines (Ghidra, RetDec, and general code models such as CodeT5). Full experimental details will remain in Section 4, but this addition will allow readers to better assess the central claim within the abstract's length constraints. revision: yes

  2. Referee: Experimental evaluation: No evidence is provided that the training or test distributions include realistic malware obfuscations such as control-flow flattening, packing, or virtualization. Without such coverage and corresponding performance breakdowns, the claim of robust generalization to real-world malicious software cannot be supported.

    Authors: This is a fair observation and highlights a genuine limitation of the current evaluation. While our training and test sets incorporate samples from malware repositories that exhibit some obfuscation, we do not provide explicit coverage or breakdowns for advanced techniques such as control-flow flattening, packing, or virtualization. In the revision we will add a new subsection discussing this scope limitation, qualify the generalization claims to reflect the evaluated distributions, and report performance on the subset of available obfuscated samples. We view this as an important direction for future work. revision: partial

  3. Referee: §3 (fine-tuning strategies): The Multi-Adapter and Seq2Seq Unified approaches are described at a high level in terms of syntactic/semantic alignment and task prefixes, but no analysis is given of how these mechanisms prevent overfitting to non-obfuscated corpora or ensure adaptation to the distribution of actual malicious binaries.

    Authors: We thank the referee for this suggestion. In the revised manuscript we will expand Section 3 with a more detailed analysis of both strategies. This will include ablation studies quantifying the contribution of the adapters to syntactic and semantic alignment, discussion of regularization effects that mitigate overfitting to non-obfuscated data, and experiments demonstrating how task-conditioned prefixes improve adaptation to malicious binary distributions. Supporting results on held-out malware samples will be added to substantiate these claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on reported experiments

full rationale

The manuscript introduces LLM4CodeRE as a domain-adaptive LLM framework with two fine-tuning strategies (Multi-Adapter and Seq2Seq Unified) and asserts outperformance plus bidirectional generalization solely via experimental results. No equations, derivations, or parameter-fitting steps appear in the provided text. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim therefore does not reduce by construction to its own inputs; it is an empirical assertion whose validity hinges on the (unexamined here) experimental design rather than definitional or self-referential closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail on any free parameters, axioms, or invented entities; no mathematical derivations or specific model components are described.

pith-pipeline@v0.9.0 · 5484 in / 938 out tokens · 49607 ms · 2026-05-10T19:17:29.838543+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution

    cs.CR 2026-05 unverdicted novelty 5.0

    LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.

Reference graph

Works this paper leans on

21 extracted references · 10 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Malware detection using control flow graphs,

    P. K. Tiwari, “Malware detection using control flow graphs,” in2024 2nd International Conference on Device Intelligence, Computing and Communication Technologies (DICCT), 2024, pp. 216–220

  2. [2]

    A static method for detecting android malware based on directed api call,

    M. Vu Minh, H.-T. Nguyen, H. V . Le, T. D. Nguyen, and X. C. Do, “A static method for detecting android malware based on directed api call,”International Journal of Web Information Systems, vol. 21, no. 3, pp. 183–204, 2025

  3. [3]

    CodeBERT: A pre-trained model for programming and natural languages,

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y . He, and Y . Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 1536–1547. ...

  4. [4]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, K. Dang, Y . Fan, Y . Zhang, A. Yang, R. Men, F. Huang, B. Zheng, Y . Miao, S. Quan, Y . Feng, X. Ren, X. Ren, J. Zhou, and J. Lin, “Qwen2.5-coder technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2409.12186

  5. [5]

    Asma-tune: Unlocking llms’ assembly code comprehension via structural-semantic instruction tuning,

    X. Wang, J. Wang, J. Su, K. Wang, P. Chen, Y . Liu, L. Liu, X. Li, Y . Wang, Q. Chen, R. Chen, and C. Jia, “Asma-tune: Unlocking llms’ assembly code comprehension via structural-semantic instruction tuning,” 2025. [Online]. Available: https://arxiv.org/abs/2503.11617

  6. [6]

    Sok: Potentials and challenges of large language models for reverse engineering,

    X. Hu, Z. Fu, S. Xie, S. H. H. Ding, and P. Charland, “Sok: Potentials and challenges of large language models for reverse engineering,” 2025. [Online]. Available: https://arxiv.org/abs/2509.21821

  7. [7]

    Large language models (LLMs) for source code analysis: applications, models and datasets.arXiv preprint arXiv:2503.17502, 2025

    H. Jelodar, M. Meymani, and R. Razavi-Far, “Large language models (llms) for source code analysis: applications, models and datasets,” 2025. [Online]. Available: https://arxiv.org/abs/2503.17502

  8. [9]

    Wadec: Decompiling we- bassembly using large language models,

    X. She, Y . Zhao, and H. Wang, “Wadec: Decompiling we- bassembly using large language models,”Proceedings of the 39th IEEE/ACM International Conference on Automated Soft- ware Engineering, 2024

  9. [10]

    Parameter-Efficient Transfer Learning for NLP

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” 2019. [Online]. Available: https://arxiv.org/abs/1902.00751

  10. [11]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” 2021. [Online]. Available: https://arxiv.org/abs/2101.00190

  11. [12]

    LoRA: Low-Rank Adaptation of Large Language Models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” 2021. [Online]. Available: https://arxiv.org/abs/2106.09685

  12. [13]

    PE Malware Machine Learning Dataset,

    pracsec, “PE Malware Machine Learning Dataset,” Jun. 2021. [Online]. Available: https://practicalsecurityanalytics.com/pe- malware-machine-learning-dataset/

  13. [14]

    Sban: A framework & multi-dimensional dataset for large language model pre-training and software code mining,

    H. Jelodar, M. Meymani, S. Bai, R. Razavi-Far, and A. A. Ghorbani, “Sban: A framework & multi-dimensional dataset for large language model pre-training and software code mining,”

  14. [15]
  15. [16]

    Llm4decompile: Decompiling binary code with large language models,

    H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decompiling binary code with large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024, p. 3473–3487. [Online]. Available: http://dx.doi.org/10.18653/v1/2024.emnlp-main.203

  16. [17]

    Compiling files in parallel: A study with gcc,

    G. Belinassi, R. Biener, J. Hubi ˇcka, D. Cordeiro, and A. Gold- man, “Compiling files in parallel: A study with gcc,” in2022 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW). IEEE, 2022, pp. 1–8

  17. [18]

    A neural network based gcc cost model for faster compiler tuning,

    H. Shahzad, A. Sanaullah, S. Arora, U. Drepper, and M. Her- bordt, “A neural network based gcc cost model for faster compiler tuning,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–9

  18. [19]

    Sandboxing and virtualization: Mod- ern tools for combating malware,

    C. Greamo and A. Ghosh, “Sandboxing and virtualization: Mod- ern tools for combating malware,”IEEE Security & Privacy, vol. 9, no. 2, pp. 79–82, 2011

  19. [20]

    Deep learning from imperfectly labeled malware data,

    F. Alotaibi, E. Goodbrand, and S. Maffeis, “Deep learning from imperfectly labeled malware data,” inProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security, 2025, pp. 3990–4004

  20. [21]

    Foredroid: Scenario-aware analysis for android malware detection and explanation,

    J. Li, S. Chen, C. Wu, Y . Zhang, and L. Fan, “Foredroid: Scenario-aware analysis for android malware detection and explanation,” inProceedings of the 2025 ACM SIGSAC Con- ference on Computer and Communications Security, 2025, pp. 1379–1393

  21. [22]

    Lm-scout: Analyzing the security of lan- guage model integration in android apps,

    M. Ibrahim, G. S. Tuncay, Z. B. Celik, A. Machiry, and A. Bianchi, “Lm-scout: Analyzing the security of lan- guage model integration in android apps,”arXiv preprint arXiv:2505.08204, 2025