pith. sign in

arxiv: 2606.20436 · v1 · pith:N2LTRQNBnew · submitted 2026-06-18 · 💻 cs.CR · cs.AI

Multi-View Decompilation for LLM-Based Malware Classification

Pith reviewed 2026-06-26 17:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords malware classificationdecompilationlarge language modelsGhidraRetDecbinary analysismulti-view promptingpseudo-C
0
0 comments X

The pith

Feeding LLMs decompiled views from both Ghidra and RetDec raises malicious-class F1 over single-view baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether an LLM receives more useful signals for malware classification when it sees pseudo-C from two different decompilers instead of one. Because each decompiler applies its own heuristics and loses different details from the original binary, the two outputs can flag different indicators of malicious behavior. Experiments on a curated set of benign and malicious samples show consistent gains in F1 for the malicious class, driven mainly by higher recall. The result holds across multiple model families and requires no training or fine-tuning. A sympathetic reader would care because malware analysts routinely work from decompiled code, and a change limited to the prompt could reduce missed detections in existing pipelines.

Core claim

The paper establishes that supplying matched pseudo-C from both Ghidra and RetDec to an LLM classifier produces higher malicious-class F1 scores than either decompiler alone, chiefly by lifting recall on malicious samples, and that the two decompilers commit partially non-overlapping errors on the same binaries.

What carries the argument

Paired pseudo-C outputs from Ghidra and RetDec, supplied together in the LLM prompt as complementary evidence for the benign/malicious decision.

If this is right

  • Malicious recall increases when both decompiler views are provided to the LLM.
  • Ghidra and RetDec generate classification errors that only partially overlap on the benchmark samples.
  • The F1 improvement appears across a range of LLMs without any model changes or additional training.
  • Multi-decompiler prompting supplies a training-free route to better malware triage in settings that already use decompiled code.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairing of decompiler outputs could be tested on related binary-analysis tasks such as vulnerability discovery or packer identification.
  • Adding a third decompiler might produce further recall gains or show diminishing returns once the main sources of disagreement are covered.
  • If decompiler diversity helps LLMs overcome individual tool limitations, comparable multi-tool prompting may help on other lossy reverse-engineering problems outside malware.

Load-bearing premise

The curated collection of benign utilities and malicious programs is representative enough that the observed F1 gain will appear on other binaries and threat behaviors.

What would settle it

Applying the identical single-view versus dual-view comparison to a new collection of binaries drawn from different malware families and observing no improvement or a decline in malicious-class F1.

Figures

Figures reproduced from arXiv: 2606.20436 by Bercan Turkmen, Vyas Raina.

Figure 1
Figure 1. Figure 1: Three workflows for binary malware classification. Conventional reverse engineering relies on a human [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prediction agreement between Ghidra-only and RetDec-only classifiers. Rows correspond to the Ghidra [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-family recall — Claude Haiku 4.5. mirai (n=13) mydoom (n=13) hellbot (n=8) dexter (n=5) carberp (n=3) x0r (n=2) remhead (n=2) minipig (n=1) rubilyn (n=1) rovnix (n=1) synthetic_keylogger (n=1) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Recall 54% 46% 50% 100% 33% 100% 50% 100% 100% 100% 100% 69% 69% 60% 33% 100% 50% 100% 100% 100% 100% 69% 85% 62% 100% 67% 100% 100% 100% 100% 100% 100% Per-family malware recall Gemin… view at source ↗
Figure 4
Figure 4. Figure 4: Per-family recall — Gemini 2.5 Flash-Lite. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-family recall — Llama-3.3-70B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-family recall — Qwen3.6-35B-A3B. mirai (n=13) mydoom (n=13) hellbot (n=8) dexter (n=5) carberp (n=3) x0r (n=2) remhead (n=2) minipig (n=1) rubilyn (n=1) rovnix (n=1) synthetic_keylogger (n=1) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Recall 54% 38% 38% 100% 67% 50% 100% 100% 100% 100% 54% 23% 25% 60% 33% 100% 50% 100% 100% 100% 54% 54% 62% 100% 67% 100% 100% 100% 100% 100% Per-family malware recall GPT-5.4 Mini Ghid… view at source ↗
Figure 7
Figure 7. Figure 7: Per-family recall — GPT-5.4-mini [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prediction agreement between Ghidra-only and RetDec-only classifiers. Rows correspond to the Ghidra [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Definition-only classification prompt used in the main experiments. The decompiled code is supplied in [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Indicator-guided prompt used only for the upper-bound analysis in Appendix [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
read the original abstract

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that providing matched pseudo-C views from both Ghidra and RetDec decompilers to LLMs improves malicious-class F1 (primarily via higher recall) over single-decompiler baselines on a curated benchmark of benign utilities and malicious programs; it further claims that agreement analyses show the decompilers make partially different errors, supporting the use of multi-view prompting as a training-free improvement for LLM-based malware triage.

Significance. If the empirical results hold under proper controls and on representative data, the work would demonstrate a low-cost way to exploit decompiler complementarity for better LLM malware classification, which could be directly useful in practical reverse-engineering settings.

major comments (2)
  1. [Abstract] Abstract: the central claim of improved malicious F1 and complementary errors rests on an empirical measurement, yet the abstract (and by extension the reported evaluation) supplies no dataset size, sample sources, compilation details, model names/sizes, statistical significance tests, or controls for prompt length/ordering; without these the observed directional improvement cannot be verified or reproduced.
  2. [Abstract] Abstract: the benchmark is described only as spanning 'a range of threat behaviors' with no counts, families, binary sizes, or diversity metrics reported; this directly undermines the load-bearing assumption that the F1 gain (driven by recall) will generalize beyond the specific curated set.
minor comments (1)
  1. The manuscript would benefit from an explicit experimental-setup subsection detailing the exact prompting templates, LLM inference parameters, and how agreement between decompiler views was quantified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We agree that greater specificity in the abstract will improve clarity, reproducibility, and support for the claims. We address each comment below and will revise the abstract in the next version.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of improved malicious F1 and complementary errors rests on an empirical measurement, yet the abstract (and by extension the reported evaluation) supplies no dataset size, sample sources, compilation details, model names/sizes, statistical significance tests, or controls for prompt length/ordering; without these the observed directional improvement cannot be verified or reproduced.

    Authors: The full manuscript reports dataset size, sample sources, compilation settings, model names/sizes, and evaluation procedures in Sections 3 and 4. We acknowledge that the abstract itself is insufficiently self-contained. We will revise the abstract to include these elements (dataset cardinality, model families, and note on significance testing). We will also add an explicit ablation on prompt length and ordering effects in the experiments section of the revised manuscript to confirm robustness. revision: yes

  2. Referee: [Abstract] Abstract: the benchmark is described only as spanning 'a range of threat behaviors' with no counts, families, binary sizes, or diversity metrics reported; this directly undermines the load-bearing assumption that the F1 gain (driven by recall) will generalize beyond the specific curated set.

    Authors: Section 3 of the manuscript already supplies the requested counts, families, size distributions, and diversity metrics for the curated benchmark. The abstract uses a concise phrasing for brevity, but we agree this limits immediate assessment of scope. We will revise the abstract to report these key statistics so that the generalization argument is better grounded. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement on held-out benchmark

full rationale

The paper describes an empirical pipeline: curation of a benchmark of binaries, decompilation via Ghidra and RetDec to produce matched pseudo-C views, and direct measurement of LLM classification performance (F1, recall) when given single vs. dual views. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central result is a straightforward comparison of observed metrics on the curated set; nothing reduces to its own inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study; it rests on the domain assumption that decompilers produce meaningfully different artefacts and on the representativeness of the chosen benchmark.

axioms (1)
  • domain assumption Decompilers are lossy heuristic tools that can expose different artefacts of the same binary.
    Explicitly stated as the motivation for using multiple views.

pith-pipeline@v0.9.1-grok · 5709 in / 1196 out tokens · 27239 ms · 2026-06-26T17:08:18.214884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    2026 , eprint =

    A Decompilation-Driven Framework for Malware Detection with Large Language Models , author =. 2026 , eprint =

  2. [2]

    2024 , eprint =

    Enhancing Reverse Engineering: Investigating and Benchmarking Large Language Models for Vulnerability Analysis in Decompiled Binaries , author =. 2024 , eprint =

  3. [3]

    Large Language Models for Code Analysis: Do

    Fang, Chongzhou and Miao, Ning and Srivastav, Shaurya and Liu, Jialin and Zhang, Ruoyu and Fang, Ruijie and Asmita and Tsang, Ryan and Nazari, Najmeh and Wang, Han and Homayoun, Houman , booktitle =. Large Language Models for Code Analysis: Do. 2024 , isbn =

  4. [4]

    ISBN 9798400706127

    Evaluating the Effectiveness of Decompilers , author =. Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024) , pages =. 2024 , address =. doi:10.1145/3650212.3652144 , url =

  5. [5]

    2024 , eprint =

    Exploring the Efficacy of Large Language Models (GPT-4) in Binary Reverse Engineering , author =. 2024 , eprint =

  6. [6]

    2014 , howpublished =

    ytisf and contributors , title =. 2014 , howpublished =

  7. [7]

    Claude Sonnet 4.6 , year =

  8. [8]

    Ghidra Software Reverse Engineering Framework , year =

  9. [9]

    2024 , howpublished =

  10. [10]

    2024 , note =

    Gemini: A Family of Highly Capable Multimodal Models , institution =. 2024 , note =

  11. [11]

    2024 , howpublished =

    Vertex. 2024 , howpublished =

  12. [12]

    Docker: Containerization Platform , year =

  13. [13]

    Cifuentes, Cristina , title =

  14. [14]

    Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =

    Lee, JongHyup and Avgerinos, Thanassis and Brumley, David , title =. Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =

  15. [15]

    and Lee, JongHyup and Woo, Maverick and Brumley, David , title =

    Schwartz, Edward J. and Lee, JongHyup and Woo, Maverick and Brumley, David , title =. 22nd USENIX Security Symposium (USENIX Security 13) , pages =. 2013 , publisher =

  16. [16]

    and Vasilescu, Bogdan and Le Goues, Claire , title =

    Dramko, Luke and Lacomis, Jeremy and Schwartz, Edward J. and Vasilescu, Bogdan and Le Goues, Claire , title =. 33rd USENIX Security Symposium (USENIX Security 24) , year =

  17. [17]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , year =. 2507.06261 , archivePrefix =

  18. [18]

    2026 , howpublished =

    Introducing. 2026 , howpublished =

  19. [19]

    Introducing Claude Haiku 4.5 , year =

  20. [20]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  21. [21]

    International Conference on Learning Representations (ICLR) , year =

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations (ICLR) , year =

  22. [22]

    2023 , eprint =

    Improving Factuality and Reasoning in Language Models through Multiagent Debate , author =. 2023 , eprint =

  23. [23]

    2024 , eprint =

    Mixture-of-Agents Enhances Large Language Model Capabilities , author =. 2024 , eprint =

  24. [24]

    Transactions on Machine Learning Research (TMLR) , year =

    More Agents Is All You Need , author =. Transactions on Machine Learning Research (TMLR) , year =

  25. [25]

    Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

    Chen, Zhijun and Li, Jingzheng and Chen, Pengpeng and Li, Zhuoran and Sun, Kai and Luo, Yuankai and Mao, Qianren and Yang, Dingqi and Sun, Hailong and Yu, Philip S. , year =. Harnessing Multiple Large Language Models: A Survey on. 2502.18036 , archivePrefix =

  26. [26]

    2022 , eprint =

    Malicious Source Code Detection Using Transformer , author =. 2022 , eprint =

  27. [27]

    2025 , url =

    Zhao, Wenxiang and Wu, Juntao and Meng, Zhaoyi , journal =. 2025 , url =

  28. [28]

    2502.13055 , archivePrefix =

    Qian, Xingzhi and Zheng, Xinran and He, Yiling and Yang, Shuo and Cavallaro, Lorenzo , year =. 2502.13055 , archivePrefix =

  29. [29]

    Feasibility Study for Supporting Static Malware Analysis Using

    Fujii, Shota and Yamagishi, Rei , booktitle =. Feasibility Study for Supporting Static Malware Analysis Using. 2025 , url =

  30. [30]

    Assessing

    Patsakis, Constantinos and Casino, Fran and Lykousas, Nikolaos , journal =. Assessing. 2024 , doi =

  31. [31]

    2024 , eprint =

    The. 2024 , eprint =

  32. [32]

    Proceedings of the 38th Annual Computer Security Applications Conference (ACSAC) , year =

    Boosting Neural Networks to Decompile Optimized Binaries , author =. Proceedings of the 38th Annual Computer Security Applications Conference (ACSAC) , year =

  33. [33]

    2024 , eprint =

    Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study , author =. 2024 , eprint =

  34. [34]

    2024 , eprint =

    Investigating Large Language Models for Code Vulnerability Detection: An Experimental Study , author =. 2024 , eprint =

  35. [35]

    Proceedings of the 45th International Conference on Software Engineering (ICSE) , pages =

    Automated Program Repair in the Era of Large Pre-trained Language Models , author =. Proceedings of the 45th International Conference on Software Engineering (ICSE) , pages =. 2023 , doi =

  36. [36]

    and Bu, H

    Zhang, J. and Bu, H. and Wen, H. and Chen, Y. and Li, L. and Zhu, H. , journal =. When. 2025 , doi =

  37. [37]

    2024 , eprint =

    Large Language Models for Cyber Security: A Systematic Literature Review , author =. 2024 , eprint =

  38. [38]

    2025 , eprint =

    Exploring Large Language Models for Semantic Analysis and Categorization of Android Malware , author =. 2025 , eprint =

  39. [39]

    2406.18379 , archivePrefix =

    Lu, Haolang and Peng, Hongrui and Nan, Guoshun and Cui, Jiaoyang and Wang, Cheng and Jin, Weifei and Wang, Songlin and Pan, Shengli and Tao, Xiaofeng , year =. 2406.18379 , archivePrefix =

  40. [40]

    Exploring

    Al-Karaki, Jamal and Khan, Muhammad Al-Zafar and Omar, Marwan , year =. Exploring. 2409.07587 , archivePrefix =