MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors
Pith reviewed 2026-05-19 10:59 UTC · model grok-4.3
The pith
MalGEN decomposes high-level attacks into stages to generate malware samples that 45 percent of detectors miss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MalGEN is a modular testbed that models adversarial workflows by decomposing high-level attack objectives into structured stages, enabling the controlled synthesis of executable artifacts; evaluation across 1,920 benchmark settings produces 977 samples exhibiting diverse malicious techniques and multi-stage patterns, of which 45.71 percent remain undetected by existing engines and thereby reveal notable gaps in current defenses.
What carries the argument
MalGEN, the modular testbed that decomposes high-level attack objectives into structured stages for synthesizing diverse executable malware artifacts in a controlled environment.
If this is right
- Existing malware repositories can be augmented with generated samples to stress-test defenses against previously unseen multi-stage behaviors.
- Detection engines exhibit measurable gaps when confronted with the synthesized attack patterns.
- Security evaluation and testing practices can incorporate stage-based generation to build more robust systems.
- Insights from the undetected samples can inform targeted improvements in behavioral and signature-based detection methods.
Where Pith is reading between the lines
- The same staged decomposition approach could be adapted to generate test artifacts for related threats such as ransomware or command-and-control evasion.
- Periodic re-runs of the benchmark settings would allow ongoing measurement of how quickly new detectors close the observed gaps.
- Combining the framework with automated behavioral logging might yield larger-scale datasets for training improved machine-learning detectors.
Load-bearing premise
Decomposing high-level attack objectives into structured stages produces malware behaviors representative of real adversarial workflows.
What would settle it
Comparing the runtime behaviors and detection rates of the generated samples against those of confirmed real-world malware from documented incidents in isolated execution environments.
Figures
read the original abstract
Modern cybersecurity requires systematic ways to evaluate how detection systems respond to evolving and previously unseen attack behaviors. Existing malware repositories largely capture known patterns and provide limited support for stress-testing defenses against novel threats. To address this, we present MalGEN, a modular testbed that models adversarial workflows and generates executable artifacts in a controlled environment. The framework decomposes high-level attack objectives into structured stages, enabling the synthesis of diverse and multi-stage behaviors. We evaluate MalGEN across 1,920 benchmark settings covering multiple platforms and behavioral objectives, resulting in 977 executable samples. Analysis shows that the generated artifacts exhibit a wide range of malicious techniques and multi-stage attack patterns. However, 45.71% of these samples remain undetected by existing detection engines, which reveals notable gaps in current defenses. These findings provide practical insights into the limitations of widely used detection approaches and support the development of more robust security evaluation and testing practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MalGEN, a modular testbed that decomposes high-level attack objectives into structured stages to generate executable malware artifacts in a controlled environment. It evaluates the framework across 1,920 benchmark settings covering multiple platforms and behavioral objectives, producing 977 executable samples. The authors report that the generated artifacts exhibit diverse malicious techniques and multi-stage patterns, with 45.71% remaining undetected by existing detection engines, which they interpret as evidence of gaps in current defenses.
Significance. If the generated samples prove to be realistic proxies for real malware, MalGEN offers a practical contribution by enabling systematic, controlled stress-testing of detection systems against novel multi-stage behaviors. The modular decomposition approach could support reproducible evaluation practices and help identify limitations in widely used detectors. The work's value would be strengthened by explicit validation that the outputs reflect representative adversarial workflows rather than testbed-specific artifacts.
major comments (2)
- [Abstract] Abstract: The headline result that 45.71% of the 977 samples remain undetected is presented as revealing 'notable gaps in current defenses.' However, the manuscript provides no details on the detection engines employed, the criteria or methodology for classifying a sample as detected (e.g., static/dynamic analysis, specific tools or AV products, scoring thresholds), or any controls for false negatives. This information is load-bearing for interpreting the evasion rate as a genuine indicator of defensive shortcomings rather than an artifact of the evaluation setup.
- [Framework and Evaluation] Framework and Evaluation sections: The central assumption that decomposing high-level objectives into structured stages accurately models real adversarial workflows and yields representative behaviors is not independently validated. The paper reports coverage across 1,920 settings and diverse techniques but includes no quantitative comparisons (e.g., behavioral similarity scores, API call sequence overlap, or expert labeling) against established real-world malware corpora. Without such evidence, the undetected fraction cannot be confidently attributed to limitations in existing detectors.
minor comments (2)
- [Abstract] Abstract: Clarify the relationship between the 1,920 benchmark settings and the final count of 977 executable samples, including any filtering or success criteria applied during generation.
- [General] General: Consider adding a dedicated limitations or threats-to-validity subsection discussing potential discrepancies between synthesized and real malware implementation details.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We will revise the manuscript to provide greater transparency on the detection evaluation and to strengthen the justification for the modeling approach.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline result that 45.71% of the 977 samples remain undetected is presented as revealing 'notable gaps in current defenses.' However, the manuscript provides no details on the detection engines employed, the criteria or methodology for classifying a sample as detected (e.g., static/dynamic analysis, specific tools or AV products, scoring thresholds), or any controls for false negatives. This information is load-bearing for interpreting the evasion rate as a genuine indicator of defensive shortcomings rather than an artifact of the evaluation setup.
Authors: We agree that the current description of the detection evaluation is insufficient. In the revised manuscript we will add a dedicated subsection under Evaluation that specifies the detection engines (including the exact AV products queried via VirusTotal and any dynamic sandboxes employed), the precise classification criteria (static signature matching combined with behavioral scoring), the decision thresholds, and the controls implemented to mitigate false negatives such as repeated scans across multiple engines and manual review of borderline cases. These additions will allow readers to evaluate the 45.71% evasion figure more rigorously. revision: yes
-
Referee: [Framework and Evaluation] Framework and Evaluation sections: The central assumption that decomposing high-level objectives into structured stages accurately models real adversarial workflows and yields representative behaviors is not independently validated. The paper reports coverage across 1,920 settings and diverse techniques but includes no quantitative comparisons (e.g., behavioral similarity scores, API call sequence overlap, or expert labeling) against established real-world malware corpora. Without such evidence, the undetected fraction cannot be confidently attributed to limitations in existing detectors.
Authors: We recognize that explicit quantitative validation against real-world corpora would increase confidence in the representativeness of the generated behaviors. The stage decomposition draws directly from documented tactics in public threat intelligence and the MITRE ATT&CK framework; however, the original submission did not include similarity metrics because the testbed’s purpose is to synthesize novel combinations rather than replicate existing samples. In the revision we will add a new paragraph in the Framework section explaining this rationale, include a limited quantitative comparison (e.g., Jaccard similarity on API-call n-grams for a random subset of generated samples versus samples from a public malware repository), and insert an explicit limitations paragraph discussing the inherent difficulty of validating against “real” malware when the goal is to explore previously unseen behaviors. We view this as a partial but substantive response to the concern. revision: partial
Circularity Check
No circularity in framework description or empirical results
full rationale
The paper describes MalGEN as a modular testbed that decomposes high-level attack objectives into structured stages to generate executable artifacts, then reports empirical coverage across 1,920 settings yielding 977 samples and a direct measurement that 45.71% evade detection engines. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text. The central claims are engineering descriptions and observational statistics rather than any result that reduces by construction to its own inputs. The work is self-contained as an empirical generation and evaluation framework with no load-bearing self-citations or ansatzes that would trigger the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-level attack objectives can be decomposed into structured stages that accurately represent real malware behaviors.
Forward citations
Cited by 4 Pith papers
-
uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs
uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.
-
The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code
A commercial LLM can cheaply produce large numbers of structurally diverse yet behaviorally equivalent malware payloads using functional prompts or history-augmented prompts.
-
The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code
A single commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse malware payloads.
-
LLM Harms: A Taxonomy and Discussion
This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.
Reference graph
Works this paper leans on
-
[1]
In: 2024 New Trends in Signal Processing (NTSP)
Adamec, M., Tur ˇcan´ık, M.: Development of malware using large lan- guage models. In: 2024 New Trends in Signal Processing (NTSP). pp. 1–
work page 2024
-
[2]
MITRE: Bedford, MA, USA (2022)
Alford, R., Lawrence, D., Kouremetis, M.: Caldera: A red-blue cyber operations automation platform. MITRE: Bedford, MA, USA (2022)
work page 2022
-
[3]
Alotaibi, L., Seher, S., Mohammad, N.: Cyberattacks using chatgpt: Exploring malicious content generation through prompt engineering. In: 2024 ASU international conference in emerging technologies for sustainability and intelligent systems (ICETSIS). pp. 1304–1311. IEEE (2024)
work page 2024
-
[4]
In: Proceedings of the 32nd annual conference on computer security applications
Applebaum, A., Miller, D., Strom, B., Korban, C., Wolf, R.: Intelligent, automated red team emulation. In: Proceedings of the 32nd annual conference on computer security applications. pp. 363–373 (2016)
work page 2016
-
[5]
arXiv preprint arXiv:2308.09183 (2023)
Beckerich, M., Plein, L., Coronado, S.: Ratgpt: Turning online llms into proxies for malware attacks. arXiv preprint arXiv:2308.09183 (2023)
-
[6]
Botacin, M.: Gpthreats-3: Is automatic malware generation a threat? In: 2023 IEEE Security and Privacy Workshops (SPW). pp. 238–254. IEEE (2023)
work page 2023
-
[7]
Advances in Neural Information Processing Systems (2020)
Brown, T.B., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems (2020)
work page 2020
-
[8]
arXiv preprint arXiv:1802.07228 (2018)
Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., Dafoe, A., Scharre, P., Zeitzoff, T., Filar, B., et al.: The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228 (2018)
-
[9]
Toward trustworthy ai development: Mechanisms for supporting veri- fiable claims,
Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., Khlaaf, H., Yang, J., Toner, H., Fong, R., et al.: Toward trustworthy ai development: mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213 (2020)
-
[11]
Evaluating Large Language Models Trained on Code
Chen, M., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Neurocomputing 623, 129406 (2025)
Coppolino, L., D’Antonio, S., Mazzeo, G., Uccello, F.: The good, the bad, and the algorithm: The impact of generative ai on cybersecurity. Neurocomputing 623, 129406 (2025). https://doi.org/https://doi.org/10.1016/j.neucom.2025.129406, https://www.sciencedirect.com/science/article/pii/S0925231225000785
-
[13]
for Cybersecurity (ENISA), E.U.A.: Enisa threat landscape 2025 (2025), https://www.enisa.europa.eu/publications/enisa-space-threat-landscape -2025
work page 2025
-
[14]
Devadiga, D., Jin, G., Potdar, B., Koo, H., Han, A., Shringi, A., Singh, A., Chaudhari, K., Kumar, S.: Gleam: Gan and llm for evasive adver- sarial malware. In: 2023 14th International Conference on Information and Communication Technology Convergence (ICTC). pp. 53–58. IEEE (2023)
work page 2023
-
[15]
Falade, P.V .: Decoding the threat landscape: Chatgpt, fraudgpt, and wormgpt in social engineering attacks. International Journal of Scientific Research in Computer Science, Engineering and Information Tech- nology p. 185–198 (Oct 2023). https://doi.org/10.32628/cseit2390533, http://dx.doi.org/10.32628/CSEIT2390533
-
[16]
Journal of Responsible Innovation 11(1), 2304381 (2024)
Grinbaum, A., and, L.A.: Dual use concerns of generative ai and large language models. Journal of Responsible Innovation 11(1), 2304381 (2024). https://doi.org/10.1080/23299460.2024.2304381, https://doi.or g/10.1080/23299460.2024.2304381
-
[17]
IEEE Access 11, 80218–80245 (2023)
Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access 11, 80218–80245 (2023). https://doi.org/10.1109/ACCESS.2023.3300381
- [18]
-
[19]
Ionescu, C.: Red-teaming code llms for malware generation (2024)
work page 2024
-
[20]
Computers & Security 147, 104077 (2024)
Iturbe, E., Llorente-Vazquez, O., Rego, A., Rios, E., Toledo, N.: Un- leashing offensive artificial intelligence: Automated attack technique code generation. Computers & Security 147, 104077 (2024)
work page 2024
-
[21]
arXiv preprint arXiv:2502.10825 (2025)
Jiang, Y ., Meng, Q., Shang, F., Oo, N., Minh, L.T.H., Lim, H.W., Sikdar, B.: Mitre att&ck applications in cybersecurity and the way forward. arXiv preprint arXiv:2502.10825 (2025)
-
[22]
Information Fusion 97, 101804 (2023)
Kaur, R., Gabrijel ˇciˇc, D., Klobu ˇcar, T.: Artificial intel- ligence for cybersecurity: Literature review and future research directions. Information Fusion 97, 101804 (2023). https://doi.org/https://doi.org/10.1016/j.inffus.2023.101804, https://www.sciencedirect.com/science/article/pii/S1566253523001136
-
[23]
In: Meng, W., Yung, M., Shao, J
Kouliaridis, V ., Karopoulos, G., Kambourakis, G.: Assessing the effec- tiveness of llms in android application vulnerability analysis. In: Meng, W., Yung, M., Shao, J. (eds.) Attacks and Defenses for the Internet-of- Things. pp. 139–154. Springer Nature Switzerland, Cham (2025)
work page 2025
-
[24]
ndss-symposium.org/wp-content/uploads/2025-1933-paper.pdf
Li, H., Yao, Z., Wu, B., Gao, C., Xu, T., Yuan, W., Luo, X.: Automated mass malware factory: The convergence of piggybacking and adversarial example in android malicious software generation (2025), https://www. ndss-symposium.org/wp-content/uploads/2025-1933-paper.pdf
work page 2025
-
[25]
Advances in Neural Information Processing Systems 36, 21558–21572 (2023)
Liu, J., Xia, C.S., Wang, Y ., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36, 21558–21572 (2023)
work page 2023
-
[26]
Madani, P.: Metamorphic malware evolution: The potential and peril of large language models. In: 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). pp. 74–81. IEEE (2023)
work page 2023
-
[27]
https://attack.mitre.o rg/ (2024), accessed: 2025-06-02
MITRE Corporation: Mitre att&ck™ framework. https://attack.mitre.o rg/ (2024), accessed: 2025-06-02
work page 2024
-
[28]
OpenAI: Gpt-4 technical report (2023), https://openai.com/research/gp t-4
work page 2023
-
[29]
In: Proceedings of the 16th cyber security experimentation and test workshop
Pa Pa, Y .M., Tanizaki, S., Kou, T., Van Eeten, M., Yoshioka, K., Matsumoto, T.: An attacker’s dream? exploring the capabilities of chatgpt for developing malware. In: Proceedings of the 16th cyber security experimentation and test workshop. pp. 10–18 (2023)
work page 2023
-
[30]
In: Proceedings of the 2023 Australasian Computer Science Week, pp
Rani, N., Saha, B., Maurya, V ., Shukla, S.K.: Ttphunter: Automated extraction of actionable intelligence as ttps from narrative threat reports. In: Proceedings of the 2023 Australasian Computer Science Week, pp. 126–134 (2023)
work page 2023
-
[31]
Digital Threats 5(4) (Dec 2024)
Rani, N., Saha, B., Maurya, V ., Shukla, S.K.: Ttpxhunter: Actionable threat intelligence extraction as ttps from finished cyber threat reports. Digital Threats 5(4) (Dec 2024). https://doi.org/10.1145/3696427, https: //doi.org/10.1145/3696427
-
[32]
https://www.atomicredteam.io/ (2024), accessed: 2025-06-02
Red Canary: Atomic red team. https://www.atomicredteam.io/ (2024), accessed: 2025-06-02
work page 2024
-
[33]
arXiv preprint arXiv:2504.01145 (2025)
Saha, B., Rani, N., Shukla, S.K.: Malaware: Automating the compre- hension of malicious software behaviours using large language models (llms). arXiv preprint arXiv:2504.01145 (2025)
-
[34]
In: 2023 International Conference on Data Security and Privacy Protection (DSPP)
Shandilya, S.K., Prharsha, G., Datta, A., Choudhary, G., Park, H., You, I.: Gpt based malware: Unveiling vulnerabilities and creating a way forward in digital space. In: 2023 International Conference on Data Security and Privacy Protection (DSPP). pp. 164–173. IEEE (2023)
work page 2023
-
[35]
Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023
Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., et al.: Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324 (2023)
-
[36]
ACM Transactions on Interactive Intelligent Systems (TiiS) 10(4), 1–31 (2020)
Shneiderman, B.: Bridging the gap between ethics and practice: guide- lines for reliable, safe, and trustworthy human-centered ai systems. ACM Transactions on Interactive Intelligent Systems (TiiS) 10(4), 1–31 (2020)
work page 2020
-
[37]
Strom, B.E., Applebaum, A., Miller, D.P., Nickels, K.C., Pennington, A.G., Thomas, C.B.: Mitre att&ck: Design and philosophy. In: Technical report. The MITRE Corporation (2018)
work page 2018
-
[38]
Verizon: 2025 data breach investigations report (2025), https://www.ve rizon.com/business/resources/reports/dbir/
work page 2025
-
[39]
In: International Conference on Web Information Systems Engineering
Yamin, M.M., Hashmi, E., Katt, B.: Combining uncensored and censored llms for ransomware generation. In: International Conference on Web Information Systems Engineering. pp. 189–202. Springer (2024)
work page 2024
-
[40]
Zhang, J., Bu, H., Wen, H., Liu, Y ., Fei, H., Xi, R., Li, L., Yang, Y ., Zhu, H., Meng, D.: When llms meet cybersecurity: A systematic literature review. Cybersecurity 8(1), 1–41 (2025) APPENDIX APPENDIX A: TTP M APPING OF GENERATED MALWARE TABLE IV: Malware Hash to TTP Mapping Hash TTPs Present in Malware 37d3a0613e89e3a978 542a84593ee1cd59f8c 0e603742b...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.