pith. sign in

arxiv: 2605.16455 · v1 · pith:4WYABC5Enew · submitted 2026-05-15 · 💻 cs.CR

MalwarePT: A Binary-Level Foundation Model for Malware Analysis

Pith reviewed 2026-05-20 18:18 UTC · model grok-4.3

classification 💻 cs.CR
keywords malware analysisbinary foundation modelsBPE tokenizationWindows PE filesAPI call predictionfunctionality classificationmalware detectiontemporal drift
0
0 comments X

The pith

Pretraining a binary encoder on Windows PE code bytes with BPE tokenization transfers to API prediction, functionality classification, and low-FPR malware detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MalwarePT as a single pretrained encoder that learns from Windows PE code-section bytes. It applies masked language modeling after training a BPE tokenizer to capture frequent multi-byte patterns. This setup is tested on token-level API call prediction, function-level classification, and document-level malware detection under temporal drift. Pretraining produces clear gains over non-pretrained baselines, BPE vocabularies around 1,024 tokens give the best balance, and the model beats other neural approaches while adding value to traditional PE structure features.

Core claim

A ModernBERT-style encoder pretrained with masked language modeling on BPE-tokenized Windows PE code-section bytes yields reusable representations that improve performance across API call prediction, functionality classification, and malware detection tasks, with the largest advantages appearing at low false-positive rates and when combined with PE-structure models.

What carries the argument

The MalwarePT encoder, a transformer pretrained via masked language modeling on BPE-compressed PE code bytes, which learns multi-byte patterns to produce task-transferable binary representations.

If this is right

  • API call sequences become more accurately predictable from raw binary bytes after pretraining.
  • Functionality labels for code segments improve when the model starts from the pretrained weights.
  • Malware detection reaches higher true-positive rates at false-positive rates near 0.001 than other neural baselines.
  • The learned representations add information not captured by handcrafted PE structure features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A single binary foundation model could replace several task-specific feature pipelines in malware pipelines.
  • Tuning BPE vocabulary size offers a practical lever for trading context length against pattern capture in other executable domains.
  • The observed complementarity with structure features points to hybrid systems that combine pretrained byte patterns with static metadata.

Load-bearing premise

Code-section bytes drawn from a broad set of Windows PE files form a representative distribution that transfers to later malware tasks without major shifts in the evaluation data.

What would settle it

Running the same downstream tasks on a fresh temporal split of malware samples collected well after the pretraining cutoff and finding no gains over non-pretrained neural baselines would falsify the transfer benefit.

Figures

Figures reproduced from arXiv: 2605.16455 by Christopher Kruegel, Giovanni Vigna, Hojjat Aghakhani, Kaie Chen, Roman Vasilenko, Saastha Vasan, Wenbo Guo, Yigitcan Kaya, Yuzhou Nie.

Figure 1
Figure 1. Figure 1: Overview of MalwarePT. Raw bytes are extracted from the code section of each PE file, converted into atomic byte-level symbols, tokenized with a BPE vocabulary, and split into fixed-length sequences. These sequences are used to pretrain a ModernBERT-inspired bidirectional encoder with masked language modeling, after which the pretrained encoder is fine-tuned end-to-end with task￾specific heads for downstre… view at source ↗
Figure 2
Figure 2. Figure 2: Attention patterns in byte-level binary encoders. BERT-style encoders [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Dataset filtering across six stages. Right: Log scale showing original executable file size (red solid), code segments size (blue dashed), and BPE tokens (green dotted). BPE tokens reduce the original binary by approximately 79%. AVClass2 failed to identify families for 7.04% of the samples. Among the remain￾ing 92.96%, we identified 1,992 distinct malware families, representing a broad spectrum of r… view at source ↗
read the original abstract

Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a path toward reusable program representations, but their application to malware analysis remains underexplored, and most still operate at byte-level tokenization, limiting their ability to capture multi-byte code patterns. In this work, we introduce MalwarePT, a binary-level foundation model for malware analysis built on a ModernBERT-style encoder and pretrained with masked language modeling on Windows PE code-section bytes. We study whether a single pretrained encoder can transfer across malware-analysis tasks at different granularities, and how tokenization design affects that transfer. We train a byte-pair encoding (BPE) tokenizer on code-section bytes to compress frequent multi-byte patterns within a fixed context budget. We evaluate MalwarePT on three downstream tasks spanning token-, function-, and document-level prediction: API call prediction, functionality classification, and malware (program) detection under temporal drift. Our evaluation demonstrates that pretraining yields substantial gains for API call prediction and functionality classification, and that increasing the BPE vocabulary beyond the byte-level baseline improves performance, with the strongest overall tradeoff at a vocabulary size of 1,024 tokens. In malware detection at FPR ~ 0.001, MalwarePT outperforms the neural network baselines, and is complementary to feature-engineering models that rely on PE structure. We also compare against existing binary foundation models and show that MalwarePT's design choices yield gains across all downstream tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces MalwarePT, a ModernBERT-style encoder pretrained via masked language modeling on Windows PE code-section bytes using BPE tokenization. It evaluates whether the resulting representations transfer to token-level (API call prediction), function-level (functionality classification), and document-level (malware detection under temporal drift) tasks, reporting consistent gains from pretraining, benefits from increasing BPE vocabulary size up to 1024, outperformance of neural baselines at low FPR in detection, and complementarity to PE-structure feature models.

Significance. If the results hold, the work indicates that BPE-based binary pretraining can produce reusable representations that improve performance across granularities in malware analysis and remain useful alongside traditional features. The temporal-drift evaluation and explicit ablations against byte-level baselines and prior binary models are positive elements that strengthen applicability claims.

major comments (1)
  1. [Evaluation under temporal drift] Temporal drift evaluation (abstract and evaluation section): the manuscript states that malware detection is assessed 'under temporal drift' yet provides no explicit confirmation that the pretraining corpus collection window precedes the evaluation cutoff, that no future-period samples entered pretraining, or that the BPE vocabulary was learned exclusively on pre-drift data. This verification is load-bearing for interpreting reported transfer gains as arising from the MLM objective rather than from reduced distribution shift between pretraining and test regimes.
minor comments (1)
  1. [Results] Notation for BPE vocabulary sizes and model variants could be standardized across tables and figures to ease comparison of the 1024-token tradeoff.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying a point that requires greater clarity in the temporal-drift evaluation. We respond to the major comment below and will revise the manuscript to address it.

read point-by-point responses
  1. Referee: [Evaluation under temporal drift] Temporal drift evaluation (abstract and evaluation section): the manuscript states that malware detection is assessed 'under temporal drift' yet provides no explicit confirmation that the pretraining corpus collection window precedes the evaluation cutoff, that no future-period samples entered pretraining, or that the BPE vocabulary was learned exclusively on pre-drift data. This verification is load-bearing for interpreting reported transfer gains as arising from the MLM objective rather than from reduced distribution shift between pretraining and test regimes.

    Authors: We agree that explicit documentation of the data-collection timelines is necessary to substantiate the temporal-drift claim. The pretraining corpus and BPE vocabulary were constructed exclusively from samples whose collection window ends before the start of the evaluation dataset used for malware detection; no future-period samples were included in either. To make this verification transparent, we will add a short subsection (or expanded paragraph) in the Evaluation section that states the relevant collection cutoffs, confirms the absence of overlap, and notes that the BPE tokenizer was fit only on the pretraining corpus. This revision will allow readers to confirm that reported gains arise from the MLM objective rather than from inadvertent distribution alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from distinct pretraining and held-out evaluation stages

full rationale

The paper reports an empirical machine-learning study: a ModernBERT encoder is pretrained once via masked language modeling on a collection of Windows PE code-section bytes (with BPE tokenization), then frozen and transferred to three separate downstream tasks (API call prediction, functionality classification, malware detection under temporal drift). Performance gains are measured as accuracy/F1/AUC numbers on held-out test splits that are not used in pretraining or vocabulary fitting. No equations, first-principles derivations, or fitted-parameter predictions appear in the abstract or described methodology that reduce the reported gains to quantities defined by the same inputs. The evaluation explicitly separates pretraining corpus from downstream test distributions, satisfying the self-contained benchmark criterion. No load-bearing self-citations, ansatzes, or renamings of known results are invoked to justify the central claims.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard transformer pretraining assumptions plus domain-specific choices about what constitutes representative malware code bytes. No new physical entities are postulated.

free parameters (2)
  • BPE vocabulary size
    Chosen after experimentation; the paper identifies 1024 as the strongest tradeoff point.
  • Model architecture hyperparameters
    ModernBERT-style encoder size, learning rate schedule, and masking probability are fitted or selected during pretraining.
axioms (2)
  • domain assumption Code-section bytes from Windows PE files form a suitable pretraining corpus for learning transferable representations for malware analysis.
    Invoked when the authors restrict pretraining to code sections and assume transfer to downstream tasks.
  • domain assumption Masked language modeling on byte sequences captures useful multi-byte code patterns for malware tasks.
    Central justification for the pretraining objective.

pith-pipeline@v0.9.0 · 5839 in / 1622 out tokens · 50351 ms · 2026-05-20T18:18:11.756903+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 5 internal anchors

  1. [1]

    In: NDSS (2020)

    Aghakhani, H., Gritti, F., Mecca, F., Lindorfer, M., Ortolani, S., Balzarotti, D., Vigna, G., Kruegel, C.: When malware is packin’heat; limits of machine learning classifiers based on static analysis features. In: NDSS (2020)

  2. [2]

    In: Pro- ceedings of the sixth ACM conference on data and application security and privacy (2016)

    Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction, selection and fusion for effective malware family classification. In: Pro- ceedings of the sixth ACM conference on data and application security and privacy (2016)

  3. [3]

    In: ACSAC (2022)

    Ahn, S., Ahn, S., Koo, H., Paek, Y.: Practical binary code similarity detection with bert-based transferable similarity learning. In: ACSAC (2022)

  4. [4]

    EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

    Anderson, H.S., Roth, P.: Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)

  5. [5]

    In: 31st USENIX Security Symposium (USENIX Security 22)

    Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressnegger, C., Cavallaro, L., Rieck, K.: Dos and don’ts of machine learning in computer secu- rity. In: 31st USENIX Security Symposium (USENIX Security 22). pp. 3971–3988 (2022)

  6. [6]

    In: 2022 IEEE Symposium on Security and Privacy (SP)

    Barbero, F., Pendlebury, F., Pierazzi, F., Cavallaro, L.: Transcending transcend: Revisiting malware classification in the presence of concept drift. In: 2022 IEEE Symposium on Security and Privacy (SP). pp. 805–823. IEEE (2022)

  7. [7]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)

  8. [8]

    In: 2025 IEEE Symposium on Security and Privacy (SP)

    Benkraouda, H., Diwan, N., Wang, G.: You can’t judge a binary by its header: Data-code separation for non-standard arm binaries using pseudo labels. In: 2025 IEEE Symposium on Security and Privacy (SP). pp. 36–36. IEEE Computer So- ciety (2024)

  9. [9]

    Transfer Learning for Image-Based Malware Classification

    Bhodia, N., Prajapati, P., Di Troia, F., Stamp, M.: Transfer learning for image- based malware classification. arXiv preprint arXiv:1903.11551 (2019)

  10. [10]

    Contributors, P.: Pytorch: An open source machine learning framework (2024), https://pytorch.org/

  11. [11]

    In: CCS (2023)

    Dambra, S., Han, Y., Aonzo, S., Kotzias, P., Vitale, A., Caballero, J., Balzarotti, D., Bilge, L.: Decoding the secrets of machine learning in malware classification: A deep dive into datasets, feature extraction, and model performance. In: CCS (2023)

  12. [12]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  13. [13]

    Fuyong, Z., Tiezhu, Z.: Malware detection and classification based on n-grams attribute similarity. In: CSE. IEEE (2017)

  14. [14]

    Hex-Rays: Ida pro.https://hex-rays.com/ida-pro

  15. [15]

    Horsicq: Detect it easy (2024),https://github.com/horsicq/Detect-It-Easy

  16. [16]

    In: NTMS

    Kalash, M., Rochan, M., Mohammed, N., Bruce, N.D., Wang, Y., Iqbal, F.: Mal- ware classification with deep convolutional neural networks. In: NTMS. IEEE (2018)

  17. [17]

    In: Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)

    Kaya, Y., Chen, Y., Botacin, M., Saha, S., Pierazzi, F., Cavallaro, L., Wagner, D., Dumitras, T.: Ml-based behavioral malware detection is far from a solved problem. In: Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE (2025)

  18. [18]

    In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security

    Kim, D., Kwon, B.J., Dumitraş, T.: Certified malware: Measuring breaches of trust in the windows code-signing pki. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. pp. 1435–1448 (2017) 22

  19. [19]

    In: 27th USENIX Security Symposium (USENIX Security 18)

    Kim, D., Kwon, B.J., Kozák, K., Gates, C., Dumitras,, T.: The broken shield: Measuring revocation effectiveness in the windows{Code-Signing}{PKI}. In: 27th USENIX Security Symposium (USENIX Security 18). pp. 851–868 (2018)

  20. [20]

    Koo, H., Park, S., Choi, D., Kim, T.: Semantic-aware binary code representation with bert (2021),https://arxiv.org/abs/2106.05478

  21. [21]

    Digital investigation3, 91–97 (2006)

    Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital investigation3, 91–97 (2006)

  22. [22]

    In: WI and IAT

    Kruczkowski, M., Szynkiewicz, E.N.: Support vector machine for malware analysis and classification. In: WI and IAT. IEEE (2014)

  23. [23]

    In: ICCAI (2018)

    Kumar, R., Xiaosong, Z., Khan, R.U., Ahad, I., Kumar, J.: Malicious code detec- tion based on image processing using deep learning. In: ICCAI (2018)

  24. [24]

    In: The Network and Distributed System Security (NDSS) Symposium (2026)

    Kurlandski, L., Berger, H., Pan, Y., Wright, M.: Beyond raw bytes: Towards large malware language models. In: The Network and Distributed System Security (NDSS) Symposium (2026)

  25. [25]

    In: Ieee infocom 2022-ieee conference on computer communications

    Ling, X., Wu, L., Deng, W., Qu, Z., Zhang, J., Zhang, S., Ma, T., Wang, B., Wu, C., Ji, S.: Malgraph: Hierarchical graph neural networks for robust windows malware detection. In: Ieee infocom 2022-ieee conference on computer communications. pp. 1998–2007. IEEE (2022)

  26. [26]

    Advances in Neural Information Processing Systems37, 58698–58715 (2024)

    Liu, C., Saul, R., Sun, Y., Raff, E., Fuchs, M., Southard Pantano, T., Holt, J., Micinski, K.: Assemblage: Automatic binary dataset construction for machine learning. Advances in Neural Information Processing Systems37, 58698–58715 (2024)

  27. [27]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  28. [28]

    In: ICDMAI

    Makandar, A., Patrot, A.: Malware class recognition using image processing tech- niques. In: ICDMAI. IEEE (2017)

  29. [29]

    MITRE ATT&CK: Mitre att&ck framework.https://attack.mitre.org

  30. [30]

    In: ICASSP

    Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware classification with recurrent networks. In: ICASSP. IEEE (2015)

  31. [31]

    arXiv preprint arXiv:2010.00770 (2020)

    Pei, K., Guan, J., Williams-King, D., Yang, J., Jana, S.: Xda: Accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770 (2020)

  32. [32]

    Pontello, M.: Trid - file identifier (2024),http://mark0.net/soft-trid-e.html

  33. [33]

    OpenAI blog (8) (2019)

    Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog (8) (2019)

  34. [34]

    In: AAAI Workshop (2018)

    Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.K.: Malware detection by eating a whole exe. In: AAAI Workshop (2018)

  35. [35]

    In: AAAI

    Raff, E., Fleshman, W., Zak, R., Anderson, H.S., Filar, B., McLean, M.: Classifying sequences of extreme length with constant memory applied to malware detection. In: AAAI. No. 11 (2021)

  36. [36]

    In: Big Data Analytics

    Rathore,H.,Agarwal,S.,Sahay,S.K.,Sewak,M.:Malwaredetectionusingmachine learning and deep learning. In: Big Data Analytics. Springer (2018)

  37. [37]

    In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

    Saha, S., Wang, W., Kaya, Y., Feizi, S., Dumitras, T.: Drsm: De-randomized smoothing on malware classifier providing certified robustness. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Con- ference on Learning Representations. vol. 2024, pp. 47666–47686 (2024)

  38. [38]

    In: ACSAC (2021) 23

    Sajid, M.S.I., Wei, J., Abdeen, B., Al-Shaer, E., Islam, M.M., Diong, W., Khan, L.: Soda: A system for cyber deception orchestration and automation. In: ACSAC (2021) 23

  39. [39]

    in- formation Sciences (2013)

    Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as rep- resentation of executables for data-mining-based unknown malware detection. in- formation Sciences (2013)

  40. [40]

    In: MALWARE

    Saxe, J., Berlin, K.: Deep neural network based malware detection using two di- mensional binary program features. In: MALWARE. IEEE (2015)

  41. [41]

    In: ACSAC (2020)

    Sebastián, S., Caballero, J.: Avclass2: Massive malware tag extraction from av labels. In: ACSAC (2020)

  42. [42]

    In: RAID

    Shafiq, M.Z., Tabish, S.M., Mirza, F., Farooq, M.: Pe-miner: Mining structural information to detect malicious executables in realtime. In: RAID. Springer (2009)

  43. [43]

    Standard Performance Evaluation Corporation: Spec cpu®2006 benchmark.ht tps://www.spec.org/cpu2006/(2006), accessed: 2025-11-20

  44. [44]

    Standard Performance Evaluation Corporation: Spec cpu®2017 benchmark.ht tps://www.spec.org/cpu2017/(2017), accessed: 2025-11-20

  45. [45]

    Neurocomputing568, 127063 (2024)

    Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)

  46. [46]

    Team, T.F.: capa: The flare team’s open-source tool to identify capabilities in executable files (2024),https://github.com/mandiant/capa

  47. [47]

    Computer Networks (2020)

    Vasan, D., Alazab, M., Wassan, S., Naeem, H., Safaei, B., Zheng, Q.: Imcfn: Image- based malware classification using fine-tuned convolutional neural network archi- tecture. Computer Networks (2020)

  48. [48]

    In: ACSAC

    Vasan, S., Aghakhani, H., Ortolani, S., Vasilenko, R., Grishchenko, I., Kruegel, C., Vigna, G.: DeepCapa: Identifying Malicious Capability in Windows Malware. In: ACSAC. IEEE (2024)

  49. [49]

    virustotal.com/

    VirusTotal: Virustotal - free online virus, malware and url scanner.https://www. virustotal.com/

  50. [50]

    In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers)

    Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational ...

  51. [51]

    arXiv preprint arXiv:1708.08042 (2017)

    Yue,S.,Wang,T.:Imbalancedmalwareimagesclassification:acnnbasedapproach. arXiv preprint arXiv:1708.08042 (2017)

  52. [52]

    Zhang, B., Sennrich, R.: Root mean square layer normalization. Advances in Neural Information Processing Systems32(2019) 24 8 Statement on Data Availability We are committed to maximizing the reproducibility of our work, and upon publication we will release all source code and pretrained model weights for every version ofMal w arePT—including the 256–4096...