pith. sign in

arxiv: 2506.07586 · v2 · submitted 2025-06-09 · 💻 cs.CR

MalGEN: A Testbed for Modeling and Evaluating Malware Behaviors

Pith reviewed 2026-05-19 10:59 UTC · model grok-4.3

classification 💻 cs.CR
keywords malware generationtestbeddetection evaluationadversarial workflowsmulti-stage attackscybersecurityexecutable artifactsevasion analysis
0
0 comments X

The pith

MalGEN decomposes high-level attacks into stages to generate malware samples that 45 percent of detectors miss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MalGEN as a modular testbed for modeling adversarial workflows and creating executable malware artifacts to evaluate how detection systems handle novel threats. It breaks high-level attack objectives into structured stages to produce diverse multi-stage behaviors across platforms and objectives. Testing in 1,920 settings yields 977 samples that display a wide range of malicious techniques. Analysis finds that existing engines leave 45.71 percent of these samples undetected. The work thereby supplies concrete evidence of gaps in current defenses and a practical method for improving security testing.

Core claim

MalGEN is a modular testbed that models adversarial workflows by decomposing high-level attack objectives into structured stages, enabling the controlled synthesis of executable artifacts; evaluation across 1,920 benchmark settings produces 977 samples exhibiting diverse malicious techniques and multi-stage patterns, of which 45.71 percent remain undetected by existing engines and thereby reveal notable gaps in current defenses.

What carries the argument

MalGEN, the modular testbed that decomposes high-level attack objectives into structured stages for synthesizing diverse executable malware artifacts in a controlled environment.

If this is right

  • Existing malware repositories can be augmented with generated samples to stress-test defenses against previously unseen multi-stage behaviors.
  • Detection engines exhibit measurable gaps when confronted with the synthesized attack patterns.
  • Security evaluation and testing practices can incorporate stage-based generation to build more robust systems.
  • Insights from the undetected samples can inform targeted improvements in behavioral and signature-based detection methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged decomposition approach could be adapted to generate test artifacts for related threats such as ransomware or command-and-control evasion.
  • Periodic re-runs of the benchmark settings would allow ongoing measurement of how quickly new detectors close the observed gaps.
  • Combining the framework with automated behavioral logging might yield larger-scale datasets for training improved machine-learning detectors.

Load-bearing premise

Decomposing high-level attack objectives into structured stages produces malware behaviors representative of real adversarial workflows.

What would settle it

Comparing the runtime behaviors and detection rates of the generated samples against those of confirmed real-world malware from documented incidents in isolated execution environments.

Figures

Figures reproduced from arXiv: 2506.07586 by Bikash Saha, Sandeep Kumar Shukla.

Figure 1
Figure 1. Figure 1: MalGEN architecture showing modular agent interactions across the malware generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Family Distribution Overview: Number of samples with no label, [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Modern cybersecurity requires systematic ways to evaluate how detection systems respond to evolving and previously unseen attack behaviors. Existing malware repositories largely capture known patterns and provide limited support for stress-testing defenses against novel threats. To address this, we present MalGEN, a modular testbed that models adversarial workflows and generates executable artifacts in a controlled environment. The framework decomposes high-level attack objectives into structured stages, enabling the synthesis of diverse and multi-stage behaviors. We evaluate MalGEN across 1,920 benchmark settings covering multiple platforms and behavioral objectives, resulting in 977 executable samples. Analysis shows that the generated artifacts exhibit a wide range of malicious techniques and multi-stage attack patterns. However, 45.71% of these samples remain undetected by existing detection engines, which reveals notable gaps in current defenses. These findings provide practical insights into the limitations of widely used detection approaches and support the development of more robust security evaluation and testing practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MalGEN, a modular testbed that decomposes high-level attack objectives into structured stages to generate executable malware artifacts in a controlled environment. It evaluates the framework across 1,920 benchmark settings covering multiple platforms and behavioral objectives, producing 977 executable samples. The authors report that the generated artifacts exhibit diverse malicious techniques and multi-stage patterns, with 45.71% remaining undetected by existing detection engines, which they interpret as evidence of gaps in current defenses.

Significance. If the generated samples prove to be realistic proxies for real malware, MalGEN offers a practical contribution by enabling systematic, controlled stress-testing of detection systems against novel multi-stage behaviors. The modular decomposition approach could support reproducible evaluation practices and help identify limitations in widely used detectors. The work's value would be strengthened by explicit validation that the outputs reflect representative adversarial workflows rather than testbed-specific artifacts.

major comments (2)
  1. [Abstract] Abstract: The headline result that 45.71% of the 977 samples remain undetected is presented as revealing 'notable gaps in current defenses.' However, the manuscript provides no details on the detection engines employed, the criteria or methodology for classifying a sample as detected (e.g., static/dynamic analysis, specific tools or AV products, scoring thresholds), or any controls for false negatives. This information is load-bearing for interpreting the evasion rate as a genuine indicator of defensive shortcomings rather than an artifact of the evaluation setup.
  2. [Framework and Evaluation] Framework and Evaluation sections: The central assumption that decomposing high-level objectives into structured stages accurately models real adversarial workflows and yields representative behaviors is not independently validated. The paper reports coverage across 1,920 settings and diverse techniques but includes no quantitative comparisons (e.g., behavioral similarity scores, API call sequence overlap, or expert labeling) against established real-world malware corpora. Without such evidence, the undetected fraction cannot be confidently attributed to limitations in existing detectors.
minor comments (2)
  1. [Abstract] Abstract: Clarify the relationship between the 1,920 benchmark settings and the final count of 977 executable samples, including any filtering or success criteria applied during generation.
  2. [General] General: Consider adding a dedicated limitations or threats-to-validity subsection discussing potential discrepancies between synthesized and real malware implementation details.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We will revise the manuscript to provide greater transparency on the detection evaluation and to strengthen the justification for the modeling approach.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline result that 45.71% of the 977 samples remain undetected is presented as revealing 'notable gaps in current defenses.' However, the manuscript provides no details on the detection engines employed, the criteria or methodology for classifying a sample as detected (e.g., static/dynamic analysis, specific tools or AV products, scoring thresholds), or any controls for false negatives. This information is load-bearing for interpreting the evasion rate as a genuine indicator of defensive shortcomings rather than an artifact of the evaluation setup.

    Authors: We agree that the current description of the detection evaluation is insufficient. In the revised manuscript we will add a dedicated subsection under Evaluation that specifies the detection engines (including the exact AV products queried via VirusTotal and any dynamic sandboxes employed), the precise classification criteria (static signature matching combined with behavioral scoring), the decision thresholds, and the controls implemented to mitigate false negatives such as repeated scans across multiple engines and manual review of borderline cases. These additions will allow readers to evaluate the 45.71% evasion figure more rigorously. revision: yes

  2. Referee: [Framework and Evaluation] Framework and Evaluation sections: The central assumption that decomposing high-level objectives into structured stages accurately models real adversarial workflows and yields representative behaviors is not independently validated. The paper reports coverage across 1,920 settings and diverse techniques but includes no quantitative comparisons (e.g., behavioral similarity scores, API call sequence overlap, or expert labeling) against established real-world malware corpora. Without such evidence, the undetected fraction cannot be confidently attributed to limitations in existing detectors.

    Authors: We recognize that explicit quantitative validation against real-world corpora would increase confidence in the representativeness of the generated behaviors. The stage decomposition draws directly from documented tactics in public threat intelligence and the MITRE ATT&CK framework; however, the original submission did not include similarity metrics because the testbed’s purpose is to synthesize novel combinations rather than replicate existing samples. In the revision we will add a new paragraph in the Framework section explaining this rationale, include a limited quantitative comparison (e.g., Jaccard similarity on API-call n-grams for a random subset of generated samples versus samples from a public malware repository), and insert an explicit limitations paragraph discussing the inherent difficulty of validating against “real” malware when the goal is to explore previously unseen behaviors. We view this as a partial but substantive response to the concern. revision: partial

Circularity Check

0 steps flagged

No circularity in framework description or empirical results

full rationale

The paper describes MalGEN as a modular testbed that decomposes high-level attack objectives into structured stages to generate executable artifacts, then reports empirical coverage across 1,920 settings yielding 977 samples and a direct measurement that 45.71% evade detection engines. No equations, derivations, fitted parameters, or first-principles predictions appear in the provided text. The central claims are engineering descriptions and observational statistics rather than any result that reduces by construction to its own inputs. The work is self-contained as an empirical generation and evaluation framework with no load-bearing self-citations or ansatzes that would trigger the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim relies on the validity of the modeling approach and the representativeness of generated samples, with no explicit free parameters or new entities introduced.

axioms (1)
  • domain assumption High-level attack objectives can be decomposed into structured stages that accurately represent real malware behaviors.
    This is invoked in the description of the framework's decomposition process.

pith-pipeline@v0.9.0 · 5681 in / 1248 out tokens · 69109 ms · 2026-05-19T10:59:32.083119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. uGen: An Agentic Framework for Generating Microarchitectural Attack PoCs

    cs.CR 2026-05 unverdicted novelty 6.0

    uGen is the first retrieval-augmented multi-agent LLM framework for generating functionally correct microarchitectural attack PoCs, reporting up to 100% success on Spectre-v1 and 80% on Prime+Probe at low cost.

  2. The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code

    cs.CR 2026-05 unverdicted novelty 6.0

    A commercial LLM can cheaply produce large numbers of structurally diverse yet behaviorally equivalent malware payloads using functional prompts or history-augmented prompts.

  3. The Infinite Mutation Engine? Measuring Polymorphism in LLM-Generated Offensive Code

    cs.CR 2026-05 unverdicted novelty 6.0

    A single commercial LLM can cheaply generate large populations of behaviorally equivalent yet structurally diverse malware payloads.

  4. LLM Harms: A Taxonomy and Discussion

    cs.CY 2025-12 unverdicted novelty 3.0

    This paper proposes a taxonomy of LLM harms in five categories and suggests mitigation strategies plus a dynamic auditing system for responsible development.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 3 Pith papers · 1 internal anchor

  1. [1]

    In: 2024 New Trends in Signal Processing (NTSP)

    Adamec, M., Tur ˇcan´ık, M.: Development of malware using large lan- guage models. In: 2024 New Trends in Signal Processing (NTSP). pp. 1–

  2. [2]

    MITRE: Bedford, MA, USA (2022)

    Alford, R., Lawrence, D., Kouremetis, M.: Caldera: A red-blue cyber operations automation platform. MITRE: Bedford, MA, USA (2022)

  3. [3]

    In: 2024 ASU international conference in emerging technologies for sustainability and intelligent systems (ICETSIS)

    Alotaibi, L., Seher, S., Mohammad, N.: Cyberattacks using chatgpt: Exploring malicious content generation through prompt engineering. In: 2024 ASU international conference in emerging technologies for sustainability and intelligent systems (ICETSIS). pp. 1304–1311. IEEE (2024)

  4. [4]

    In: Proceedings of the 32nd annual conference on computer security applications

    Applebaum, A., Miller, D., Strom, B., Korban, C., Wolf, R.: Intelligent, automated red team emulation. In: Proceedings of the 32nd annual conference on computer security applications. pp. 363–373 (2016)

  5. [5]

    arXiv preprint arXiv:2308.09183 (2023)

    Beckerich, M., Plein, L., Coronado, S.: Ratgpt: Turning online llms into proxies for malware attacks. arXiv preprint arXiv:2308.09183 (2023)

  6. [6]

    Botacin, M.: Gpthreats-3: Is automatic malware generation a threat? In: 2023 IEEE Security and Privacy Workshops (SPW). pp. 238–254. IEEE (2023)

  7. [7]

    Advances in Neural Information Processing Systems (2020)

    Brown, T.B., et al.: Language models are few-shot learners. Advances in Neural Information Processing Systems (2020)

  8. [8]

    arXiv preprint arXiv:1802.07228 (2018)

    Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., Dafoe, A., Scharre, P., Zeitzoff, T., Filar, B., et al.: The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228 (2018)

  9. [9]

    Toward trustworthy ai development: Mechanisms for supporting veri- fiable claims,

    Brundage, M., Avin, S., Wang, J., Belfield, H., Krueger, G., Hadfield, G., Khlaaf, H., Yang, J., Toner, H., Fong, R., et al.: Toward trustworthy ai development: mechanisms for supporting verifiable claims. arXiv preprint arXiv:2004.07213 (2020)

  10. [11]

    Evaluating Large Language Models Trained on Code

    Chen, M., et al.: Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021)

  11. [12]

    Neurocomputing 623, 129406 (2025)

    Coppolino, L., D’Antonio, S., Mazzeo, G., Uccello, F.: The good, the bad, and the algorithm: The impact of generative ai on cybersecurity. Neurocomputing 623, 129406 (2025). https://doi.org/https://doi.org/10.1016/j.neucom.2025.129406, https://www.sciencedirect.com/science/article/pii/S0925231225000785

  12. [13]

    for Cybersecurity (ENISA), E.U.A.: Enisa threat landscape 2025 (2025), https://www.enisa.europa.eu/publications/enisa-space-threat-landscape -2025

  13. [14]

    In: 2023 14th International Conference on Information and Communication Technology Convergence (ICTC)

    Devadiga, D., Jin, G., Potdar, B., Koo, H., Han, A., Shringi, A., Singh, A., Chaudhari, K., Kumar, S.: Gleam: Gan and llm for evasive adver- sarial malware. In: 2023 14th International Conference on Information and Communication Technology Convergence (ICTC). pp. 53–58. IEEE (2023)

  14. [15]

    International Journal of Scientific Research in Computer Science, Engineering and Information Tech- nology p

    Falade, P.V .: Decoding the threat landscape: Chatgpt, fraudgpt, and wormgpt in social engineering attacks. International Journal of Scientific Research in Computer Science, Engineering and Information Tech- nology p. 185–198 (Oct 2023). https://doi.org/10.32628/cseit2390533, http://dx.doi.org/10.32628/CSEIT2390533

  15. [16]

    Journal of Responsible Innovation 11(1), 2304381 (2024)

    Grinbaum, A., and, L.A.: Dual use concerns of generative ai and large language models. Journal of Responsible Innovation 11(1), 2304381 (2024). https://doi.org/10.1080/23299460.2024.2304381, https://doi.or g/10.1080/23299460.2024.2304381

  16. [17]

    IEEE Access 11, 80218–80245 (2023)

    Gupta, M., Akiri, C., Aryal, K., Parker, E., Praharaj, L.: From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access 11, 80218–80245 (2023). https://doi.org/10.1109/ACCESS.2023.3300381

  17. [18]

    Hickey, A.: The gpt dilemma: Foundation models and the shadow of dual-use (2024), https://arxiv.org/abs/2407.20442

  18. [19]

    Ionescu, C.: Red-teaming code llms for malware generation (2024)

  19. [20]

    Computers & Security 147, 104077 (2024)

    Iturbe, E., Llorente-Vazquez, O., Rego, A., Rios, E., Toledo, N.: Un- leashing offensive artificial intelligence: Automated attack technique code generation. Computers & Security 147, 104077 (2024)

  20. [21]

    arXiv preprint arXiv:2502.10825 (2025)

    Jiang, Y ., Meng, Q., Shang, F., Oo, N., Minh, L.T.H., Lim, H.W., Sikdar, B.: Mitre att&ck applications in cybersecurity and the way forward. arXiv preprint arXiv:2502.10825 (2025)

  21. [22]

    Information Fusion 97, 101804 (2023)

    Kaur, R., Gabrijel ˇciˇc, D., Klobu ˇcar, T.: Artificial intel- ligence for cybersecurity: Literature review and future research directions. Information Fusion 97, 101804 (2023). https://doi.org/https://doi.org/10.1016/j.inffus.2023.101804, https://www.sciencedirect.com/science/article/pii/S1566253523001136

  22. [23]

    In: Meng, W., Yung, M., Shao, J

    Kouliaridis, V ., Karopoulos, G., Kambourakis, G.: Assessing the effec- tiveness of llms in android application vulnerability analysis. In: Meng, W., Yung, M., Shao, J. (eds.) Attacks and Defenses for the Internet-of- Things. pp. 139–154. Springer Nature Switzerland, Cham (2025)

  23. [24]

    ndss-symposium.org/wp-content/uploads/2025-1933-paper.pdf

    Li, H., Yao, Z., Wu, B., Gao, C., Xu, T., Yuan, W., Luo, X.: Automated mass malware factory: The convergence of piggybacking and adversarial example in android malicious software generation (2025), https://www. ndss-symposium.org/wp-content/uploads/2025-1933-paper.pdf

  24. [25]

    Advances in Neural Information Processing Systems 36, 21558–21572 (2023)

    Liu, J., Xia, C.S., Wang, Y ., Zhang, L.: Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. Advances in Neural Information Processing Systems 36, 21558–21572 (2023)

  25. [26]

    In: 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA)

    Madani, P.: Metamorphic malware evolution: The potential and peril of large language models. In: 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). pp. 74–81. IEEE (2023)

  26. [27]

    https://attack.mitre.o rg/ (2024), accessed: 2025-06-02

    MITRE Corporation: Mitre att&ck™ framework. https://attack.mitre.o rg/ (2024), accessed: 2025-06-02

  27. [28]

    OpenAI: Gpt-4 technical report (2023), https://openai.com/research/gp t-4

  28. [29]

    In: Proceedings of the 16th cyber security experimentation and test workshop

    Pa Pa, Y .M., Tanizaki, S., Kou, T., Van Eeten, M., Yoshioka, K., Matsumoto, T.: An attacker’s dream? exploring the capabilities of chatgpt for developing malware. In: Proceedings of the 16th cyber security experimentation and test workshop. pp. 10–18 (2023)

  29. [30]

    In: Proceedings of the 2023 Australasian Computer Science Week, pp

    Rani, N., Saha, B., Maurya, V ., Shukla, S.K.: Ttphunter: Automated extraction of actionable intelligence as ttps from narrative threat reports. In: Proceedings of the 2023 Australasian Computer Science Week, pp. 126–134 (2023)

  30. [31]

    Digital Threats 5(4) (Dec 2024)

    Rani, N., Saha, B., Maurya, V ., Shukla, S.K.: Ttpxhunter: Actionable threat intelligence extraction as ttps from finished cyber threat reports. Digital Threats 5(4) (Dec 2024). https://doi.org/10.1145/3696427, https: //doi.org/10.1145/3696427

  31. [32]

    https://www.atomicredteam.io/ (2024), accessed: 2025-06-02

    Red Canary: Atomic red team. https://www.atomicredteam.io/ (2024), accessed: 2025-06-02

  32. [33]

    arXiv preprint arXiv:2504.01145 (2025)

    Saha, B., Rani, N., Shukla, S.K.: Malaware: Automating the compre- hension of malicious software behaviours using large language models (llms). arXiv preprint arXiv:2504.01145 (2025)

  33. [34]

    In: 2023 International Conference on Data Security and Privacy Protection (DSPP)

    Shandilya, S.K., Prharsha, G., Datta, A., Choudhary, G., Park, H., You, I.: Gpt based malware: Unveiling vulnerabilities and creating a way forward in digital space. In: 2023 International Conference on Data Security and Privacy Protection (DSPP). pp. 164–173. IEEE (2023)

  34. [35]

    Model evaluation for extreme risks.arXiv preprint arXiv:2305.15324, 2023

    Shevlane, T., Farquhar, S., Garfinkel, B., Phuong, M., Whittlestone, J., Leung, J., Kokotajlo, D., Marchal, N., Anderljung, M., Kolt, N., et al.: Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324 (2023)

  35. [36]

    ACM Transactions on Interactive Intelligent Systems (TiiS) 10(4), 1–31 (2020)

    Shneiderman, B.: Bridging the gap between ethics and practice: guide- lines for reliable, safe, and trustworthy human-centered ai systems. ACM Transactions on Interactive Intelligent Systems (TiiS) 10(4), 1–31 (2020)

  36. [37]

    In: Technical report

    Strom, B.E., Applebaum, A., Miller, D.P., Nickels, K.C., Pennington, A.G., Thomas, C.B.: Mitre att&ck: Design and philosophy. In: Technical report. The MITRE Corporation (2018)

  37. [38]

    Verizon: 2025 data breach investigations report (2025), https://www.ve rizon.com/business/resources/reports/dbir/

  38. [39]

    In: International Conference on Web Information Systems Engineering

    Yamin, M.M., Hashmi, E., Katt, B.: Combining uncensored and censored llms for ransomware generation. In: International Conference on Web Information Systems Engineering. pp. 189–202. Springer (2024)

  39. [40]

    Zhang, J., Bu, H., Wen, H., Liu, Y ., Fei, H., Xi, R., Li, L., Yang, Y ., Zhu, H., Meng, D.: When llms meet cybersecurity: A systematic literature review. Cybersecurity 8(1), 1–41 (2025) APPENDIX APPENDIX A: TTP M APPING OF GENERATED MALWARE TABLE IV: Malware Hash to TTP Mapping Hash TTPs Present in Malware 37d3a0613e89e3a978 542a84593ee1cd59f8c 0e603742b...