MalwarePT: A Binary-Level Foundation Model for Malware Analysis
Pith reviewed 2026-05-20 18:18 UTC · model grok-4.3
The pith
Pretraining a binary encoder on Windows PE code bytes with BPE tokenization transfers to API prediction, functionality classification, and low-FPR malware detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A ModernBERT-style encoder pretrained with masked language modeling on BPE-tokenized Windows PE code-section bytes yields reusable representations that improve performance across API call prediction, functionality classification, and malware detection tasks, with the largest advantages appearing at low false-positive rates and when combined with PE-structure models.
What carries the argument
The MalwarePT encoder, a transformer pretrained via masked language modeling on BPE-compressed PE code bytes, which learns multi-byte patterns to produce task-transferable binary representations.
If this is right
- API call sequences become more accurately predictable from raw binary bytes after pretraining.
- Functionality labels for code segments improve when the model starts from the pretrained weights.
- Malware detection reaches higher true-positive rates at false-positive rates near 0.001 than other neural baselines.
- The learned representations add information not captured by handcrafted PE structure features.
Where Pith is reading between the lines
- A single binary foundation model could replace several task-specific feature pipelines in malware pipelines.
- Tuning BPE vocabulary size offers a practical lever for trading context length against pattern capture in other executable domains.
- The observed complementarity with structure features points to hybrid systems that combine pretrained byte patterns with static metadata.
Load-bearing premise
Code-section bytes drawn from a broad set of Windows PE files form a representative distribution that transfers to later malware tasks without major shifts in the evaluation data.
What would settle it
Running the same downstream tasks on a fresh temporal split of malware samples collected well after the pretraining cutoff and finding no gains over non-pretrained neural baselines would falsify the transfer benefit.
Figures
read the original abstract
Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a path toward reusable program representations, but their application to malware analysis remains underexplored, and most still operate at byte-level tokenization, limiting their ability to capture multi-byte code patterns. In this work, we introduce MalwarePT, a binary-level foundation model for malware analysis built on a ModernBERT-style encoder and pretrained with masked language modeling on Windows PE code-section bytes. We study whether a single pretrained encoder can transfer across malware-analysis tasks at different granularities, and how tokenization design affects that transfer. We train a byte-pair encoding (BPE) tokenizer on code-section bytes to compress frequent multi-byte patterns within a fixed context budget. We evaluate MalwarePT on three downstream tasks spanning token-, function-, and document-level prediction: API call prediction, functionality classification, and malware (program) detection under temporal drift. Our evaluation demonstrates that pretraining yields substantial gains for API call prediction and functionality classification, and that increasing the BPE vocabulary beyond the byte-level baseline improves performance, with the strongest overall tradeoff at a vocabulary size of 1,024 tokens. In malware detection at FPR ~ 0.001, MalwarePT outperforms the neural network baselines, and is complementary to feature-engineering models that rely on PE structure. We also compare against existing binary foundation models and show that MalwarePT's design choices yield gains across all downstream tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MalwarePT, a ModernBERT-style encoder pretrained via masked language modeling on Windows PE code-section bytes using BPE tokenization. It evaluates whether the resulting representations transfer to token-level (API call prediction), function-level (functionality classification), and document-level (malware detection under temporal drift) tasks, reporting consistent gains from pretraining, benefits from increasing BPE vocabulary size up to 1024, outperformance of neural baselines at low FPR in detection, and complementarity to PE-structure feature models.
Significance. If the results hold, the work indicates that BPE-based binary pretraining can produce reusable representations that improve performance across granularities in malware analysis and remain useful alongside traditional features. The temporal-drift evaluation and explicit ablations against byte-level baselines and prior binary models are positive elements that strengthen applicability claims.
major comments (1)
- [Evaluation under temporal drift] Temporal drift evaluation (abstract and evaluation section): the manuscript states that malware detection is assessed 'under temporal drift' yet provides no explicit confirmation that the pretraining corpus collection window precedes the evaluation cutoff, that no future-period samples entered pretraining, or that the BPE vocabulary was learned exclusively on pre-drift data. This verification is load-bearing for interpreting reported transfer gains as arising from the MLM objective rather than from reduced distribution shift between pretraining and test regimes.
minor comments (1)
- [Results] Notation for BPE vocabulary sizes and model variants could be standardized across tables and figures to ease comparison of the 1024-token tradeoff.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying a point that requires greater clarity in the temporal-drift evaluation. We respond to the major comment below and will revise the manuscript to address it.
read point-by-point responses
-
Referee: [Evaluation under temporal drift] Temporal drift evaluation (abstract and evaluation section): the manuscript states that malware detection is assessed 'under temporal drift' yet provides no explicit confirmation that the pretraining corpus collection window precedes the evaluation cutoff, that no future-period samples entered pretraining, or that the BPE vocabulary was learned exclusively on pre-drift data. This verification is load-bearing for interpreting reported transfer gains as arising from the MLM objective rather than from reduced distribution shift between pretraining and test regimes.
Authors: We agree that explicit documentation of the data-collection timelines is necessary to substantiate the temporal-drift claim. The pretraining corpus and BPE vocabulary were constructed exclusively from samples whose collection window ends before the start of the evaluation dataset used for malware detection; no future-period samples were included in either. To make this verification transparent, we will add a short subsection (or expanded paragraph) in the Evaluation section that states the relevant collection cutoffs, confirms the absence of overlap, and notes that the BPE tokenizer was fit only on the pretraining corpus. This revision will allow readers to confirm that reported gains arise from the MLM objective rather than from inadvertent distribution alignment. revision: yes
Circularity Check
No circularity: empirical results from distinct pretraining and held-out evaluation stages
full rationale
The paper reports an empirical machine-learning study: a ModernBERT encoder is pretrained once via masked language modeling on a collection of Windows PE code-section bytes (with BPE tokenization), then frozen and transferred to three separate downstream tasks (API call prediction, functionality classification, malware detection under temporal drift). Performance gains are measured as accuracy/F1/AUC numbers on held-out test splits that are not used in pretraining or vocabulary fitting. No equations, first-principles derivations, or fitted-parameter predictions appear in the abstract or described methodology that reduce the reported gains to quantities defined by the same inputs. The evaluation explicitly separates pretraining corpus from downstream test distributions, satisfying the self-contained benchmark criterion. No load-bearing self-citations, ansatzes, or renamings of known results are invoked to justify the central claims.
Axiom & Free-Parameter Ledger
free parameters (2)
- BPE vocabulary size
- Model architecture hyperparameters
axioms (2)
- domain assumption Code-section bytes from Windows PE files form a suitable pretraining corpus for learning transferable representations for malware analysis.
- domain assumption Masked language modeling on byte sequences captures useful multi-byte code patterns for malware tasks.
Reference graph
Works this paper leans on
-
[1]
Aghakhani, H., Gritti, F., Mecca, F., Lindorfer, M., Ortolani, S., Balzarotti, D., Vigna, G., Kruegel, C.: When malware is packin’heat; limits of machine learning classifiers based on static analysis features. In: NDSS (2020)
work page 2020
-
[2]
In: Pro- ceedings of the sixth ACM conference on data and application security and privacy (2016)
Ahmadi, M., Ulyanov, D., Semenov, S., Trofimov, M., Giacinto, G.: Novel feature extraction, selection and fusion for effective malware family classification. In: Pro- ceedings of the sixth ACM conference on data and application security and privacy (2016)
work page 2016
-
[3]
Ahn, S., Ahn, S., Koo, H., Paek, Y.: Practical binary code similarity detection with bert-based transferable similarity learning. In: ACSAC (2022)
work page 2022
-
[4]
EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models
Anderson, H.S., Roth, P.: Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
In: 31st USENIX Security Symposium (USENIX Security 22)
Arp, D., Quiring, E., Pendlebury, F., Warnecke, A., Pierazzi, F., Wressnegger, C., Cavallaro, L., Rieck, K.: Dos and don’ts of machine learning in computer secu- rity. In: 31st USENIX Security Symposium (USENIX Security 22). pp. 3971–3988 (2022)
work page 2022
-
[6]
In: 2022 IEEE Symposium on Security and Privacy (SP)
Barbero, F., Pendlebury, F., Pierazzi, F., Cavallaro, L.: Transcending transcend: Revisiting malware classification in the presence of concept drift. In: 2022 IEEE Symposium on Security and Privacy (SP). pp. 805–823. IEEE (2022)
work page 2022
-
[7]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[8]
In: 2025 IEEE Symposium on Security and Privacy (SP)
Benkraouda, H., Diwan, N., Wang, G.: You can’t judge a binary by its header: Data-code separation for non-standard arm binaries using pseudo labels. In: 2025 IEEE Symposium on Security and Privacy (SP). pp. 36–36. IEEE Computer So- ciety (2024)
work page 2025
-
[9]
Transfer Learning for Image-Based Malware Classification
Bhodia, N., Prajapati, P., Di Troia, F., Stamp, M.: Transfer learning for image- based malware classification. arXiv preprint arXiv:1903.11551 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[10]
Contributors, P.: Pytorch: An open source machine learning framework (2024), https://pytorch.org/
work page 2024
-
[11]
Dambra, S., Han, Y., Aonzo, S., Kotzias, P., Vitale, A., Caballero, J., Balzarotti, D., Bilge, L.: Decoding the secrets of machine learning in malware classification: A deep dive into datasets, feature extraction, and model performance. In: CCS (2023)
work page 2023
-
[12]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Fuyong, Z., Tiezhu, Z.: Malware detection and classification based on n-grams attribute similarity. In: CSE. IEEE (2017)
work page 2017
-
[14]
Hex-Rays: Ida pro.https://hex-rays.com/ida-pro
-
[15]
Horsicq: Detect it easy (2024),https://github.com/horsicq/Detect-It-Easy
work page 2024
- [16]
-
[17]
In: Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML)
Kaya, Y., Chen, Y., Botacin, M., Saha, S., Pierazzi, F., Cavallaro, L., Wagner, D., Dumitras, T.: Ml-based behavioral malware detection is far from a solved problem. In: Proceedings of the 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML). IEEE (2025)
work page 2025
-
[18]
In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security
Kim, D., Kwon, B.J., Dumitraş, T.: Certified malware: Measuring breaches of trust in the windows code-signing pki. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. pp. 1435–1448 (2017) 22
work page 2017
-
[19]
In: 27th USENIX Security Symposium (USENIX Security 18)
Kim, D., Kwon, B.J., Kozák, K., Gates, C., Dumitras,, T.: The broken shield: Measuring revocation effectiveness in the windows{Code-Signing}{PKI}. In: 27th USENIX Security Symposium (USENIX Security 18). pp. 851–868 (2018)
work page 2018
- [20]
-
[21]
Digital investigation3, 91–97 (2006)
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digital investigation3, 91–97 (2006)
work page 2006
-
[22]
Kruczkowski, M., Szynkiewicz, E.N.: Support vector machine for malware analysis and classification. In: WI and IAT. IEEE (2014)
work page 2014
-
[23]
Kumar, R., Xiaosong, Z., Khan, R.U., Ahad, I., Kumar, J.: Malicious code detec- tion based on image processing using deep learning. In: ICCAI (2018)
work page 2018
-
[24]
In: The Network and Distributed System Security (NDSS) Symposium (2026)
Kurlandski, L., Berger, H., Pan, Y., Wright, M.: Beyond raw bytes: Towards large malware language models. In: The Network and Distributed System Security (NDSS) Symposium (2026)
work page 2026
-
[25]
In: Ieee infocom 2022-ieee conference on computer communications
Ling, X., Wu, L., Deng, W., Qu, Z., Zhang, J., Zhang, S., Ma, T., Wang, B., Wu, C., Ji, S.: Malgraph: Hierarchical graph neural networks for robust windows malware detection. In: Ieee infocom 2022-ieee conference on computer communications. pp. 1998–2007. IEEE (2022)
work page 2022
-
[26]
Advances in Neural Information Processing Systems37, 58698–58715 (2024)
Liu, C., Saul, R., Sun, Y., Raff, E., Fuchs, M., Southard Pantano, T., Holt, J., Micinski, K.: Assemblage: Automatic binary dataset construction for machine learning. Advances in Neural Information Processing Systems37, 58698–58715 (2024)
work page 2024
-
[27]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[28]
Makandar, A., Patrot, A.: Malware class recognition using image processing tech- niques. In: ICDMAI. IEEE (2017)
work page 2017
-
[29]
MITRE ATT&CK: Mitre att&ck framework.https://attack.mitre.org
-
[30]
Pascanu, R., Stokes, J.W., Sanossian, H., Marinescu, M., Thomas, A.: Malware classification with recurrent networks. In: ICASSP. IEEE (2015)
work page 2015
-
[31]
arXiv preprint arXiv:2010.00770 (2020)
Pei, K., Guan, J., Williams-King, D., Yang, J., Jana, S.: Xda: Accurate, robust disassembly with transfer learning. arXiv preprint arXiv:2010.00770 (2020)
-
[32]
Pontello, M.: Trid - file identifier (2024),http://mark0.net/soft-trid-e.html
work page 2024
-
[33]
Radford,A.,Wu,J.,Child,R.,Luan,D.,Amodei,D.,Sutskever,I.,etal.:Language models are unsupervised multitask learners. OpenAI blog (8) (2019)
work page 2019
-
[34]
Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., Nicholas, C.K.: Malware detection by eating a whole exe. In: AAAI Workshop (2018)
work page 2018
- [35]
-
[36]
Rathore,H.,Agarwal,S.,Sahay,S.K.,Sewak,M.:Malwaredetectionusingmachine learning and deep learning. In: Big Data Analytics. Springer (2018)
work page 2018
-
[37]
In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y
Saha, S., Wang, W., Kaya, Y., Feizi, S., Dumitras, T.: Drsm: De-randomized smoothing on malware classifier providing certified robustness. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Con- ference on Learning Representations. vol. 2024, pp. 47666–47686 (2024)
work page 2024
-
[38]
Sajid, M.S.I., Wei, J., Abdeen, B., Al-Shaer, E., Islam, M.M., Diong, W., Khan, L.: Soda: A system for cyber deception orchestration and automation. In: ACSAC (2021) 23
work page 2021
-
[39]
Santos, I., Brezo, F., Ugarte-Pedrero, X., Bringas, P.G.: Opcode sequences as rep- resentation of executables for data-mining-based unknown malware detection. in- formation Sciences (2013)
work page 2013
-
[40]
Saxe, J., Berlin, K.: Deep neural network based malware detection using two di- mensional binary program features. In: MALWARE. IEEE (2015)
work page 2015
-
[41]
Sebastián, S., Caballero, J.: Avclass2: Massive malware tag extraction from av labels. In: ACSAC (2020)
work page 2020
- [42]
-
[43]
Standard Performance Evaluation Corporation: Spec cpu®2006 benchmark.ht tps://www.spec.org/cpu2006/(2006), accessed: 2025-11-20
work page 2006
-
[44]
Standard Performance Evaluation Corporation: Spec cpu®2017 benchmark.ht tps://www.spec.org/cpu2017/(2017), accessed: 2025-11-20
work page 2017
-
[45]
Neurocomputing568, 127063 (2024)
Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., Liu, Y.: Roformer: Enhanced trans- former with rotary position embedding. Neurocomputing568, 127063 (2024)
work page 2024
-
[46]
Team, T.F.: capa: The flare team’s open-source tool to identify capabilities in executable files (2024),https://github.com/mandiant/capa
work page 2024
-
[47]
Vasan, D., Alazab, M., Wassan, S., Naeem, H., Safaei, B., Zheng, Q.: Imcfn: Image- based malware classification using fine-tuned convolutional neural network archi- tecture. Computer Networks (2020)
work page 2020
- [48]
-
[49]
VirusTotal: Virustotal - free online virus, malware and url scanner.https://www. virustotal.com/
-
[50]
Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational ...
work page 2025
-
[51]
arXiv preprint arXiv:1708.08042 (2017)
Yue,S.,Wang,T.:Imbalancedmalwareimagesclassification:acnnbasedapproach. arXiv preprint arXiv:1708.08042 (2017)
-
[52]
Zhang, B., Sennrich, R.: Root mean square layer normalization. Advances in Neural Information Processing Systems32(2019) 24 8 Statement on Data Availability We are committed to maximizing the reproducibility of our work, and upon publication we will release all source code and pretrained model weights for every version ofMal w arePT—including the 256–4096...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.