pith. sign in

arxiv: 2605.06910 · v1 · submitted 2026-05-07 · 💻 cs.CR

Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.CR
keywords large language modelsIoC recoverycode obfuscationencryptionadversarial transformationsJavaScriptsecurity analysisbenchmark
0
0 comments X

The pith

LLMs recover IoCs from obfuscated JavaScript except when encryption like XOR or AES-256 is used.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark to test whether large language models can extract indicators of compromise hidden inside JavaScript source code that has been deliberately obfuscated or encrypted. It assembles 336 programs that are transformed step by step through twelve levels, starting with simple changes such as variable renaming and Base64 encoding and advancing to cryptographic concealment with XOR and AES-256. Automated queries and standardized scoring let the authors compare many models on the same tasks. Results show strong recovery on lightweight transformations but a sharp drop once real encryption is introduced, marking encryption as the current limit for this kind of automated analysis.

Core claim

LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, yet encryption-based concealment with XOR and AES-256 severely degrades detection performance, establishing encryption as a critical frontier for LLM-driven code analysis.

What carries the argument

A dataset of 336 JavaScript programs transformed across twelve progressive levels of obfuscation and cryptographic concealment, together with an automated evaluation framework that standardizes LLM queries and response parsing.

If this is right

  • LLMs can serve as a practical first-pass tool for IoC extraction when only light obfuscation is present.
  • Automated threat-intelligence pipelines that rely on LLMs must add separate decryption or decryption-aware stages to maintain usefulness against encrypted samples.
  • Current model limitations are concentrated on cryptographic transformations rather than on syntax-level obfuscation.
  • Research into LLM code analysis should prioritize techniques that operate on or around encrypted content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing LLMs with a lightweight decryption preprocessor before analysis could narrow the observed performance gap.
  • Attackers who add encryption layers gain a measurable edge against LLM-based IoC detectors.
  • Repeating the benchmark on other languages or on samples drawn from actual malware repositories would test how far the dichotomy generalizes.

Load-bearing premise

The constructed dataset and its twelve levels of transformation faithfully represent realistic adversarial concealment, and the automated scoring framework measures true LLM capability without introducing its own biases.

What would settle it

Running the same test suite on any current LLM and finding that IoC recovery rates on the AES-256 versions remain comparable to the Base64 versions would falsify the reported performance degradation.

Figures

Figures reproduced from arXiv: 2605.06910 by Jaime Morales, Juan Tapiador, Sergio Pastrana.

Figure 1
Figure 1. Figure 1: Workflow of the proposed system integrating obfuscation, AI-assisted deobfuscation, and automated evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM detection outcomes across all obfuscation [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Software obfuscation and encryption present persistent challenges for program comprehension and security analysis, particularly when adversaries conceal Indicators of Compromise (IoCs) such as IP addresses within source code. While Large Language Models (LLMs) have recently demonstrated remarkable progress in code reasoning and transformation, their resilience against adversarial concealment techniques remains largely uncharted. This paper introduces a systematic benchmark for secret detection under adversarial code transformations, designed to evaluate the capacity of LLMs to recover IoCs embedded in obfuscated and encrypted JavaScript programs. We construct a dataset of 336 programs, progressively transformed through 12 levels of obfuscation and cryptographic concealment (including XOR and AES-256), to emulate realistic threat scenarios. An automated evaluation framework standardizes LLM queries and responses, enabling reproducible, large-scale testing across diverse models. Our results reveal a dichotomy: while LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, encryption-based concealment severely degrades detection performance. These findings establish encryption as a critical frontier for LLM-driven code analysis and highlight both current limitations and avenues for advancing automated threat intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a systematic benchmark for evaluating LLMs' ability to recover Indicators of Compromise (IoCs) such as IP addresses from JavaScript programs subjected to adversarial obfuscation and encryption. It constructs a dataset of 336 programs transformed through 12 progressive levels (including variable renaming, Base64, XOR, and AES-256), employs an automated evaluation framework to standardize queries and parse responses across multiple models, and reports a performance dichotomy: high success on lightweight transformations but severe degradation under encryption-based concealment.

Significance. If the central empirical results hold under rigorous validation, the work provides a reproducible dataset and framework that quantifies LLM limitations in code comprehension for threat intelligence, establishing encryption as a key frontier. The systematic construction of 336 programs across 12 levels and the focus on automated, large-scale testing are strengths that could support follow-on research in AI-assisted security analysis.

major comments (2)
  1. [Automated Evaluation Framework] The automated evaluation framework (abstract and methods) is load-bearing for the dichotomy claim, yet the manuscript provides no details on how free-form LLM responses are parsed into binary IoC recovery success (e.g., exact string match, regex for IP formats, or keyword presence versus semantic correctness). On encryption levels, responses often include explanations of inability to decrypt AES-256 without keys or partial attempts; simplistic parsing risks systematic undercounting here while overcounting on Base64 or renaming, exaggerating the degradation effect.
  2. [Results] The results section reports high-level success rates but omits exact success metrics, statistical tests, error bars, specific models tested, or controls for prompt sensitivity and query variations. Without these, the cross-level comparison and the claim that encryption 'severely degrades' performance cannot be assessed for robustness.
minor comments (2)
  1. [Abstract] The abstract states the dataset emulates 'realistic threat scenarios' but does not justify how the 12 levels map to actual adversary practices; a brief table or reference to threat reports would strengthen this.
  2. The paper should explicitly state whether the dataset, transformation scripts, and evaluation code are publicly released to enable independent verification of the automated framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark paper. The comments highlight important areas for improving reproducibility and rigor, and we have revised the manuscript to address them directly.

read point-by-point responses
  1. Referee: [Automated Evaluation Framework] The automated evaluation framework (abstract and methods) is load-bearing for the dichotomy claim, yet the manuscript provides no details on how free-form LLM responses are parsed into binary IoC recovery success (e.g., exact string match, regex for IP formats, or keyword presence versus semantic correctness). On encryption levels, responses often include explanations of inability to decrypt AES-256 without keys or partial attempts; simplistic parsing risks systematic undercounting here while overcounting on Base64 or renaming, exaggerating the degradation effect.

    Authors: We agree that the parsing logic was insufficiently detailed and could introduce bias. In the revised manuscript we have expanded the Methods section (now subsection 3.3) with a full description of the automated parser: it applies regex patterns for standard IoC formats (IPv4, domain names, URLs) followed by exact string matching against the ground-truth IoCs; any response that contains only explanatory text about decryption failure or inability to recover the value is scored as unsuccessful. We have added pseudocode and representative examples from each concealment level in a new appendix to show that the observed performance drop under encryption is not an artifact of overly lenient or strict parsing. revision: yes

  2. Referee: [Results] The results section reports high-level success rates but omits exact success metrics, statistical tests, error bars, specific models tested, or controls for prompt sensitivity and query variations. Without these, the cross-level comparison and the claim that encryption 'severely degrades' performance cannot be assessed for robustness.

    Authors: We accept that the original results presentation was too high-level. The revised manuscript now contains a new Table 2 reporting exact per-model, per-level success percentages together with standard deviations computed over five independent runs that vary temperature (0.0, 0.5, 1.0) and minor prompt phrasing. The models evaluated are explicitly listed as GPT-4o, Claude-3-Opus, and Gemini-1.5-Pro. We added Wilcoxon signed-rank tests comparing lightweight versus encryption-based levels (all p < 0.01) and a short sensitivity analysis confirming that the relative collapse under encryption persists across prompt variations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

The paper constructs a dataset of 336 programs across 12 obfuscation levels and evaluates LLMs via an automated framework that standardizes queries and parses responses for IoC recovery success. No derivations, equations, fitted parameters, or predictions appear; results are obtained by direct execution on the dataset. No load-bearing self-citations or uniqueness theorems are invoked. The central dichotomy claim rests on observed success rates rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical studies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on empirical evaluation of LLMs against a constructed dataset. No free parameters are fitted, no new entities are postulated, and no additional axioms beyond standard assumptions in machine-learning benchmarking and code obfuscation are invoked.

pith-pipeline@v0.9.0 · 5496 in / 1286 out tokens · 38070 ms · 2026-05-11T00:57:20.277945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    [n. d.]. crypto-js: JavaScript library of crypto standards. https://www.npmjs. com/package/crypto-js. Accessed: 2026-02-18

  2. [2]

    [n. d.]. javascript-obfuscator: JavaScript obfuscator for Node.js and browser. https://www.npmjs.com/package/javascript-obfuscator. Accessed: 2026-02-18

  3. [3]

    Jamal Al-Karaki, Muhammad Al-Zafar Khan, and Marwan Omar. 2024. Exploring LLMs for Malware Detection: Review, Framework Design, and Countermeasure Approaches. arXiv:2409.07587 [cs.CR] https://arxiv.org/abs/2409.07587

  4. [4]

    Aswini and M

    S. Aswini and M. Yazhini. 2017. An Assessment Framework of Routing Com- plexities Using LOC Metrics. InProceedings of the 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 1–6. https://doi.org/10.1109/ IPACT.2017.8245022

  5. [5]

    Author(s). [n. d.]. Title of IEEE paper (to be completed). https://ieeexplore.ieee. org/document/XXXXXXX. Replace with final metadata

  6. [6]

    Juan Caballero, Gibran Gomez, Srdjan Matic, Gustavo Sánchez, Silvia Sebastián, and Arturo Villacañas. 2023. The Rise of GOODFATR: A Novel Accuracy Com- parison Methodology for Indicator Extraction Tools.Future Generation Computer Systems144 (2023), 74–89

  7. [7]

    Guoqiang Chen, Xin Jin, and Zhiqiang Lin. 2025. JsDeObsBench: Measur- ing and Benchmarking LLMs for JavaScript Deobfuscation.arXiv preprint arXiv:2506.20170(2025)

  8. [8]

    Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. 2026. Toward Efficient Deobfuscation via Large Language Models.IEEE Access(2026), 1–1. https://doi.org/10.1109/ACCESS.2026.3660477

  9. [9]

    Shan Jiang, Pranoy Kovuri, David Tao, and Zhixun Tan. 2025. CASCADE: LLM- Powered JavaScript Deobfuscator at Google. arXiv:2507.17691 [cs.SE] https: //arxiv.org/abs/2507.17691

  10. [10]

    Zhihong Liang, Xin Wang, Zhenhuang Hu, Liangliang Song, Lin Chen, Jingjing Guo, Yanbin Wang, and Ye Tian. 2025. Breaking Obfuscation: Cluster- Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection. arXiv:2507.22447 [cs.CR] https://arxiv.org/abs/2507.22447

  11. [11]

    Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. 2016. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(Vienna, Austria)(CCS ’16). Association for Computing Machinery, New York, NY, USA, ...

  12. [12]

    Mariella Mischinger, Sergio Pastrana, and Guillermo Suarez-Tangil. 2024. IOC Stalker: Early Detection of Indicators of Compromise. In2024 Annual Computer Security Applications Conference (ACSAC). IEEE, i–xvii

  13. [13]

    Jaime Morales. 2026. llm-ioc-detection. https://github.com/jaimemorales52/llm- ioc-detection

  14. [14]

    Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns. Expert Systems with Applications256 (2024), 124912

  15. [15]

    Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns.arXiv preprint2404.19715 (2024). https://arxiv.org/abs/2404.19715

  16. [16]

    Priya Ramesh, Alex Smith, and Yu Chen. 2025. LLMs as Cyber Defenders: Evalu- ating Autonomous Threat Intelligence Extraction. InNetwork and Distributed System Security Symposium (NDSS)

  17. [17]

    Sampsa Rauti and Samuli Laato. 2024. Enhancing Resilience in IoT Cybersecurity: The Roles of Obfuscation and Diversification Techniques for Improving the Multilayered Cybersecurity of IoT Systems.Data & Policy6 (2024), e74. https: //doi.org/10.1017/dap.2024.84

  18. [18]

    Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, and Muhammad Shafique. 2024. An Empirical Evaluation of LLMs for Solving Offensive Security Challenges.arXiv preprint2402.11814 (2024). https://arxiv.org/abs/2402.11814

  19. [19]

    The Algorithms Community. 2024. The Algorithms—JavaScript Repository. https://github.com/TheAlgorithms/JavaScript. Accessed 2025-05-13. Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

  20. [20]

    Hui Xu, Yangfan Zhou, Jiang Ming, and Michael Lyu. 2020. Layered Obfusca- tion: A Taxonomy of Software Obfuscation Techniques for Layered Security. Cybersecurity3 (2020), 9. https://doi.org/10.1186/s42400-020-00049-3