Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

Jaime Morales; Juan Tapiador; Sergio Pastrana

arxiv: 2605.06910 · v1 · submitted 2026-05-07 · 💻 cs.CR

Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

Jaime Morales , Sergio Pastrana , Juan Tapiador This is my paper

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.CR

keywords large language modelsIoC recoverycode obfuscationencryptionadversarial transformationsJavaScriptsecurity analysisbenchmark

0 comments

The pith

LLMs recover IoCs from obfuscated JavaScript except when encryption like XOR or AES-256 is used.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark to test whether large language models can extract indicators of compromise hidden inside JavaScript source code that has been deliberately obfuscated or encrypted. It assembles 336 programs that are transformed step by step through twelve levels, starting with simple changes such as variable renaming and Base64 encoding and advancing to cryptographic concealment with XOR and AES-256. Automated queries and standardized scoring let the authors compare many models on the same tasks. Results show strong recovery on lightweight transformations but a sharp drop once real encryption is introduced, marking encryption as the current limit for this kind of automated analysis.

Core claim

LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, yet encryption-based concealment with XOR and AES-256 severely degrades detection performance, establishing encryption as a critical frontier for LLM-driven code analysis.

What carries the argument

A dataset of 336 JavaScript programs transformed across twelve progressive levels of obfuscation and cryptographic concealment, together with an automated evaluation framework that standardizes LLM queries and response parsing.

If this is right

LLMs can serve as a practical first-pass tool for IoC extraction when only light obfuscation is present.
Automated threat-intelligence pipelines that rely on LLMs must add separate decryption or decryption-aware stages to maintain usefulness against encrypted samples.
Current model limitations are concentrated on cryptographic transformations rather than on syntax-level obfuscation.
Research into LLM code analysis should prioritize techniques that operate on or around encrypted content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing LLMs with a lightweight decryption preprocessor before analysis could narrow the observed performance gap.
Attackers who add encryption layers gain a measurable edge against LLM-based IoC detectors.
Repeating the benchmark on other languages or on samples drawn from actual malware repositories would test how far the dichotomy generalizes.

Load-bearing premise

The constructed dataset and its twelve levels of transformation faithfully represent realistic adversarial concealment, and the automated scoring framework measures true LLM capability without introducing its own biases.

What would settle it

Running the same test suite on any current LLM and finding that IoC recovery rates on the AES-256 versions remain comparable to the Base64 versions would falsify the reported performance degradation.

Figures

Figures reproduced from arXiv: 2605.06910 by Jaime Morales, Juan Tapiador, Sergio Pastrana.

**Figure 1.** Figure 1: Workflow of the proposed system integrating obfuscation, AI-assisted deobfuscation, and automated evaluation. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: LLM detection outcomes across all obfuscation [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Software obfuscation and encryption present persistent challenges for program comprehension and security analysis, particularly when adversaries conceal Indicators of Compromise (IoCs) such as IP addresses within source code. While Large Language Models (LLMs) have recently demonstrated remarkable progress in code reasoning and transformation, their resilience against adversarial concealment techniques remains largely uncharted. This paper introduces a systematic benchmark for secret detection under adversarial code transformations, designed to evaluate the capacity of LLMs to recover IoCs embedded in obfuscated and encrypted JavaScript programs. We construct a dataset of 336 programs, progressively transformed through 12 levels of obfuscation and cryptographic concealment (including XOR and AES-256), to emulate realistic threat scenarios. An automated evaluation framework standardizes LLM queries and responses, enabling reproducible, large-scale testing across diverse models. Our results reveal a dichotomy: while LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, encryption-based concealment severely degrades detection performance. These findings establish encryption as a critical frontier for LLM-driven code analysis and highlight both current limitations and avenues for advancing automated threat intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds a graduated benchmark showing LLMs handle basic JS obfuscation for IoC recovery but drop sharply on encryption, though the automated scoring method needs tighter validation.

read the letter

The main thing here is a new benchmark that steps through 12 levels of obfuscation and encryption on 336 JavaScript programs to test how well LLMs pull out embedded IoCs like IP addresses. They start with light changes such as variable renaming and Base64, then move to XOR and AES-256, and report that performance holds up on the easy stuff but collapses once real encryption is involved. That tracks with what most people would guess about current models and code reasoning, but having a controlled, progressive testbed makes the drop-off measurable rather than anecdotal. The automated query-and-response framework is also a practical step for running these checks at scale without manual grading each time. What they did well is keep the transformations realistic enough to mimic adversary tactics while keeping the evaluation reproducible across models. The dataset construction itself looks like a solid piece of engineering that others could extend or reuse. The soft spots sit mostly in the measurement side. The framework has to turn open-ended LLM replies into binary success or failure, and if it leans on string matching or keyword checks instead of semantic judgment, it risks under-counting partial or explanatory answers on the encryption levels. That could inflate the apparent gap between lightweight and heavy concealment. The abstract also leaves out the exact success criteria, which models were run, error bars, and any controls for prompt variation, so the numbers are difficult to weigh without the methods section. This is the sort of work that would interest people building LLM-assisted security tools or threat-intel pipelines. A reader who needs a concrete test set for code concealment would get direct value from the levels and the dataset. I would send it for peer review because the benchmark is new and the practical question is clear, even if the scoring details and statistical reporting need tightening before the results can be taken as firm.

Referee Report

2 major / 2 minor

Summary. The paper introduces a systematic benchmark for evaluating LLMs' ability to recover Indicators of Compromise (IoCs) such as IP addresses from JavaScript programs subjected to adversarial obfuscation and encryption. It constructs a dataset of 336 programs transformed through 12 progressive levels (including variable renaming, Base64, XOR, and AES-256), employs an automated evaluation framework to standardize queries and parse responses across multiple models, and reports a performance dichotomy: high success on lightweight transformations but severe degradation under encryption-based concealment.

Significance. If the central empirical results hold under rigorous validation, the work provides a reproducible dataset and framework that quantifies LLM limitations in code comprehension for threat intelligence, establishing encryption as a key frontier. The systematic construction of 336 programs across 12 levels and the focus on automated, large-scale testing are strengths that could support follow-on research in AI-assisted security analysis.

major comments (2)

[Automated Evaluation Framework] The automated evaluation framework (abstract and methods) is load-bearing for the dichotomy claim, yet the manuscript provides no details on how free-form LLM responses are parsed into binary IoC recovery success (e.g., exact string match, regex for IP formats, or keyword presence versus semantic correctness). On encryption levels, responses often include explanations of inability to decrypt AES-256 without keys or partial attempts; simplistic parsing risks systematic undercounting here while overcounting on Base64 or renaming, exaggerating the degradation effect.
[Results] The results section reports high-level success rates but omits exact success metrics, statistical tests, error bars, specific models tested, or controls for prompt sensitivity and query variations. Without these, the cross-level comparison and the claim that encryption 'severely degrades' performance cannot be assessed for robustness.

minor comments (2)

[Abstract] The abstract states the dataset emulates 'realistic threat scenarios' but does not justify how the 12 levels map to actual adversary practices; a brief table or reference to threat reports would strengthen this.
The paper should explicitly state whether the dataset, transformation scripts, and evaluation code are publicly released to enable independent verification of the automated framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our benchmark paper. The comments highlight important areas for improving reproducibility and rigor, and we have revised the manuscript to address them directly.

read point-by-point responses

Referee: [Automated Evaluation Framework] The automated evaluation framework (abstract and methods) is load-bearing for the dichotomy claim, yet the manuscript provides no details on how free-form LLM responses are parsed into binary IoC recovery success (e.g., exact string match, regex for IP formats, or keyword presence versus semantic correctness). On encryption levels, responses often include explanations of inability to decrypt AES-256 without keys or partial attempts; simplistic parsing risks systematic undercounting here while overcounting on Base64 or renaming, exaggerating the degradation effect.

Authors: We agree that the parsing logic was insufficiently detailed and could introduce bias. In the revised manuscript we have expanded the Methods section (now subsection 3.3) with a full description of the automated parser: it applies regex patterns for standard IoC formats (IPv4, domain names, URLs) followed by exact string matching against the ground-truth IoCs; any response that contains only explanatory text about decryption failure or inability to recover the value is scored as unsuccessful. We have added pseudocode and representative examples from each concealment level in a new appendix to show that the observed performance drop under encryption is not an artifact of overly lenient or strict parsing. revision: yes
Referee: [Results] The results section reports high-level success rates but omits exact success metrics, statistical tests, error bars, specific models tested, or controls for prompt sensitivity and query variations. Without these, the cross-level comparison and the claim that encryption 'severely degrades' performance cannot be assessed for robustness.

Authors: We accept that the original results presentation was too high-level. The revised manuscript now contains a new Table 2 reporting exact per-model, per-level success percentages together with standard deviations computed over five independent runs that vary temperature (0.0, 0.5, 1.0) and minor prompt phrasing. The models evaluated are explicitly listed as GPT-4o, Claude-3-Opus, and Gemini-1.5-Pro. We added Wilcoxon signed-rank tests comparing lightweight versus encryption-based levels (all p < 0.01) and a short sensitivity analysis confirming that the relative collapse under encryption persists across prompt variations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with direct measurements

full rationale

The paper constructs a dataset of 336 programs across 12 obfuscation levels and evaluates LLMs via an automated framework that standardizes queries and parses responses for IoC recovery success. No derivations, equations, fitted parameters, or predictions appear; results are obtained by direct execution on the dataset. No load-bearing self-citations or uniqueness theorems are invoked. The central dichotomy claim rests on observed success rates rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical studies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on empirical evaluation of LLMs against a constructed dataset. No free parameters are fitted, no new entities are postulated, and no additional axioms beyond standard assumptions in machine-learning benchmarking and code obfuscation are invoked.

pith-pipeline@v0.9.0 · 5496 in / 1286 out tokens · 38070 ms · 2026-05-11T00:57:20.277945+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results reveal a dichotomy: while LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, encryption-based concealment severely degrades detection performance.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct a dataset of 336 programs, progressively transformed through 12 levels of obfuscation and cryptographic concealment (including XOR and AES-256)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

[n. d.]. crypto-js: JavaScript library of crypto standards. https://www.npmjs. com/package/crypto-js. Accessed: 2026-02-18

work page 2026
[2]

[n. d.]. javascript-obfuscator: JavaScript obfuscator for Node.js and browser. https://www.npmjs.com/package/javascript-obfuscator. Accessed: 2026-02-18

work page 2026
[3]

Jamal Al-Karaki, Muhammad Al-Zafar Khan, and Marwan Omar. 2024. Exploring LLMs for Malware Detection: Review, Framework Design, and Countermeasure Approaches. arXiv:2409.07587 [cs.CR] https://arxiv.org/abs/2409.07587

work page arXiv 2024
[4]

Aswini and M

S. Aswini and M. Yazhini. 2017. An Assessment Framework of Routing Com- plexities Using LOC Metrics. InProceedings of the 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 1–6. https://doi.org/10.1109/ IPACT.2017.8245022

work page arXiv 2017
[5]

Author(s). [n. d.]. Title of IEEE paper (to be completed). https://ieeexplore.ieee. org/document/XXXXXXX. Replace with final metadata

work page
[6]

Juan Caballero, Gibran Gomez, Srdjan Matic, Gustavo Sánchez, Silvia Sebastián, and Arturo Villacañas. 2023. The Rise of GOODFATR: A Novel Accuracy Com- parison Methodology for Indicator Extraction Tools.Future Generation Computer Systems144 (2023), 74–89

work page 2023
[7]

Guoqiang Chen, Xin Jin, and Zhiqiang Lin. 2025. JsDeObsBench: Measur- ing and Benchmarking LLMs for JavaScript Deobfuscation.arXiv preprint arXiv:2506.20170(2025)

work page arXiv 2025
[8]

Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. 2026. Toward Efficient Deobfuscation via Large Language Models.IEEE Access(2026), 1–1. https://doi.org/10.1109/ACCESS.2026.3660477

work page doi:10.1109/access.2026.3660477 2026
[9]

Shan Jiang, Pranoy Kovuri, David Tao, and Zhixun Tan. 2025. CASCADE: LLM- Powered JavaScript Deobfuscator at Google. arXiv:2507.17691 [cs.SE] https: //arxiv.org/abs/2507.17691

work page arXiv 2025
[10]

Zhihong Liang, Xin Wang, Zhenhuang Hu, Liangliang Song, Lin Chen, Jingjing Guo, Yanbin Wang, and Ye Tian. 2025. Breaking Obfuscation: Cluster- Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection. arXiv:2507.22447 [cs.CR] https://arxiv.org/abs/2507.22447

work page arXiv 2025
[11]

Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. 2016. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(Vienna, Austria)(CCS ’16). Association for Computing Machinery, New York, NY, USA, ...

work page doi:10.1145/2976749.2978315 2016
[12]

Mariella Mischinger, Sergio Pastrana, and Guillermo Suarez-Tangil. 2024. IOC Stalker: Early Detection of Indicators of Compromise. In2024 Annual Computer Security Applications Conference (ACSAC). IEEE, i–xvii

work page 2024
[13]

Jaime Morales. 2026. llm-ioc-detection. https://github.com/jaimemorales52/llm- ioc-detection

work page 2026
[14]

Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns. Expert Systems with Applications256 (2024), 124912

work page 2024
[15]

Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns.arXiv preprint2404.19715 (2024). https://arxiv.org/abs/2404.19715

work page arXiv 2024
[16]

Priya Ramesh, Alex Smith, and Yu Chen. 2025. LLMs as Cyber Defenders: Evalu- ating Autonomous Threat Intelligence Extraction. InNetwork and Distributed System Security Symposium (NDSS)

work page 2025
[17]

Sampsa Rauti and Samuli Laato. 2024. Enhancing Resilience in IoT Cybersecurity: The Roles of Obfuscation and Diversification Techniques for Improving the Multilayered Cybersecurity of IoT Systems.Data & Policy6 (2024), e74. https: //doi.org/10.1017/dap.2024.84

work page doi:10.1017/dap.2024.84 2024
[18]

Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, and Muhammad Shafique. 2024. An Empirical Evaluation of LLMs for Solving Offensive Security Challenges.arXiv preprint2402.11814 (2024). https://arxiv.org/abs/2402.11814

work page arXiv 2024
[19]

The Algorithms Community. 2024. The Algorithms—JavaScript Repository. https://github.com/TheAlgorithms/JavaScript. Accessed 2025-05-13. Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

work page 2024
[20]

Hui Xu, Yangfan Zhou, Jiang Ming, and Michael Lyu. 2020. Layered Obfusca- tion: A Taxonomy of Software Obfuscation Techniques for Layered Security. Cybersecurity3 (2020), 9. https://doi.org/10.1186/s42400-020-00049-3

work page doi:10.1186/s42400-020-00049-3 2020

[1] [1]

[n. d.]. crypto-js: JavaScript library of crypto standards. https://www.npmjs. com/package/crypto-js. Accessed: 2026-02-18

work page 2026

[2] [2]

[n. d.]. javascript-obfuscator: JavaScript obfuscator for Node.js and browser. https://www.npmjs.com/package/javascript-obfuscator. Accessed: 2026-02-18

work page 2026

[3] [3]

Jamal Al-Karaki, Muhammad Al-Zafar Khan, and Marwan Omar. 2024. Exploring LLMs for Malware Detection: Review, Framework Design, and Countermeasure Approaches. arXiv:2409.07587 [cs.CR] https://arxiv.org/abs/2409.07587

work page arXiv 2024

[4] [4]

Aswini and M

S. Aswini and M. Yazhini. 2017. An Assessment Framework of Routing Com- plexities Using LOC Metrics. InProceedings of the 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 1–6. https://doi.org/10.1109/ IPACT.2017.8245022

work page arXiv 2017

[5] [5]

Author(s). [n. d.]. Title of IEEE paper (to be completed). https://ieeexplore.ieee. org/document/XXXXXXX. Replace with final metadata

work page

[6] [6]

Juan Caballero, Gibran Gomez, Srdjan Matic, Gustavo Sánchez, Silvia Sebastián, and Arturo Villacañas. 2023. The Rise of GOODFATR: A Novel Accuracy Com- parison Methodology for Indicator Extraction Tools.Future Generation Computer Systems144 (2023), 74–89

work page 2023

[7] [7]

Guoqiang Chen, Xin Jin, and Zhiqiang Lin. 2025. JsDeObsBench: Measur- ing and Benchmarking LLMs for JavaScript Deobfuscation.arXiv preprint arXiv:2506.20170(2025)

work page arXiv 2025

[8] [8]

Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. 2026. Toward Efficient Deobfuscation via Large Language Models.IEEE Access(2026), 1–1. https://doi.org/10.1109/ACCESS.2026.3660477

work page doi:10.1109/access.2026.3660477 2026

[9] [9]

Shan Jiang, Pranoy Kovuri, David Tao, and Zhixun Tan. 2025. CASCADE: LLM- Powered JavaScript Deobfuscator at Google. arXiv:2507.17691 [cs.SE] https: //arxiv.org/abs/2507.17691

work page arXiv 2025

[10] [10]

Zhihong Liang, Xin Wang, Zhenhuang Hu, Liangliang Song, Lin Chen, Jingjing Guo, Yanbin Wang, and Ye Tian. 2025. Breaking Obfuscation: Cluster- Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection. arXiv:2507.22447 [cs.CR] https://arxiv.org/abs/2507.22447

work page arXiv 2025

[11] [11]

Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. 2016. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(Vienna, Austria)(CCS ’16). Association for Computing Machinery, New York, NY, USA, ...

work page doi:10.1145/2976749.2978315 2016

[12] [12]

Mariella Mischinger, Sergio Pastrana, and Guillermo Suarez-Tangil. 2024. IOC Stalker: Early Detection of Indicators of Compromise. In2024 Annual Computer Security Applications Conference (ACSAC). IEEE, i–xvii

work page 2024

[13] [13]

Jaime Morales. 2026. llm-ioc-detection. https://github.com/jaimemorales52/llm- ioc-detection

work page 2026

[14] [14]

Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns. Expert Systems with Applications256 (2024), 124912

work page 2024

[15] [15]

Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns.arXiv preprint2404.19715 (2024). https://arxiv.org/abs/2404.19715

work page arXiv 2024

[16] [16]

Priya Ramesh, Alex Smith, and Yu Chen. 2025. LLMs as Cyber Defenders: Evalu- ating Autonomous Threat Intelligence Extraction. InNetwork and Distributed System Security Symposium (NDSS)

work page 2025

[17] [17]

Sampsa Rauti and Samuli Laato. 2024. Enhancing Resilience in IoT Cybersecurity: The Roles of Obfuscation and Diversification Techniques for Improving the Multilayered Cybersecurity of IoT Systems.Data & Policy6 (2024), e74. https: //doi.org/10.1017/dap.2024.84

work page doi:10.1017/dap.2024.84 2024

[18] [18]

Minghao Shao, Boyuan Chen, Sofija Jancheska, Brendan Dolan-Gavitt, Siddharth Garg, Ramesh Karri, and Muhammad Shafique. 2024. An Empirical Evaluation of LLMs for Solving Offensive Security Challenges.arXiv preprint2402.11814 (2024). https://arxiv.org/abs/2402.11814

work page arXiv 2024

[19] [19]

The Algorithms Community. 2024. The Algorithms—JavaScript Repository. https://github.com/TheAlgorithms/JavaScript. Accessed 2025-05-13. Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption

work page 2024

[20] [20]

Hui Xu, Yangfan Zhou, Jiang Ming, and Michael Lyu. 2020. Layered Obfusca- tion: A Taxonomy of Software Obfuscation Techniques for Layered Security. Cybersecurity3 (2020), 9. https://doi.org/10.1186/s42400-020-00049-3

work page doi:10.1186/s42400-020-00049-3 2020