Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption
Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3
The pith
LLMs recover IoCs from obfuscated JavaScript except when encryption like XOR or AES-256 is used.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, yet encryption-based concealment with XOR and AES-256 severely degrades detection performance, establishing encryption as a critical frontier for LLM-driven code analysis.
What carries the argument
A dataset of 336 JavaScript programs transformed across twelve progressive levels of obfuscation and cryptographic concealment, together with an automated evaluation framework that standardizes LLM queries and response parsing.
If this is right
- LLMs can serve as a practical first-pass tool for IoC extraction when only light obfuscation is present.
- Automated threat-intelligence pipelines that rely on LLMs must add separate decryption or decryption-aware stages to maintain usefulness against encrypted samples.
- Current model limitations are concentrated on cryptographic transformations rather than on syntax-level obfuscation.
- Research into LLM code analysis should prioritize techniques that operate on or around encrypted content.
Where Pith is reading between the lines
- Pairing LLMs with a lightweight decryption preprocessor before analysis could narrow the observed performance gap.
- Attackers who add encryption layers gain a measurable edge against LLM-based IoC detectors.
- Repeating the benchmark on other languages or on samples drawn from actual malware repositories would test how far the dichotomy generalizes.
Load-bearing premise
The constructed dataset and its twelve levels of transformation faithfully represent realistic adversarial concealment, and the automated scoring framework measures true LLM capability without introducing its own biases.
What would settle it
Running the same test suite on any current LLM and finding that IoC recovery rates on the AES-256 versions remain comparable to the Base64 versions would falsify the reported performance degradation.
Figures
read the original abstract
Software obfuscation and encryption present persistent challenges for program comprehension and security analysis, particularly when adversaries conceal Indicators of Compromise (IoCs) such as IP addresses within source code. While Large Language Models (LLMs) have recently demonstrated remarkable progress in code reasoning and transformation, their resilience against adversarial concealment techniques remains largely uncharted. This paper introduces a systematic benchmark for secret detection under adversarial code transformations, designed to evaluate the capacity of LLMs to recover IoCs embedded in obfuscated and encrypted JavaScript programs. We construct a dataset of 336 programs, progressively transformed through 12 levels of obfuscation and cryptographic concealment (including XOR and AES-256), to emulate realistic threat scenarios. An automated evaluation framework standardizes LLM queries and responses, enabling reproducible, large-scale testing across diverse models. Our results reveal a dichotomy: while LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, encryption-based concealment severely degrades detection performance. These findings establish encryption as a critical frontier for LLM-driven code analysis and highlight both current limitations and avenues for advancing automated threat intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a systematic benchmark for evaluating LLMs' ability to recover Indicators of Compromise (IoCs) such as IP addresses from JavaScript programs subjected to adversarial obfuscation and encryption. It constructs a dataset of 336 programs transformed through 12 progressive levels (including variable renaming, Base64, XOR, and AES-256), employs an automated evaluation framework to standardize queries and parse responses across multiple models, and reports a performance dichotomy: high success on lightweight transformations but severe degradation under encryption-based concealment.
Significance. If the central empirical results hold under rigorous validation, the work provides a reproducible dataset and framework that quantifies LLM limitations in code comprehension for threat intelligence, establishing encryption as a key frontier. The systematic construction of 336 programs across 12 levels and the focus on automated, large-scale testing are strengths that could support follow-on research in AI-assisted security analysis.
major comments (2)
- [Automated Evaluation Framework] The automated evaluation framework (abstract and methods) is load-bearing for the dichotomy claim, yet the manuscript provides no details on how free-form LLM responses are parsed into binary IoC recovery success (e.g., exact string match, regex for IP formats, or keyword presence versus semantic correctness). On encryption levels, responses often include explanations of inability to decrypt AES-256 without keys or partial attempts; simplistic parsing risks systematic undercounting here while overcounting on Base64 or renaming, exaggerating the degradation effect.
- [Results] The results section reports high-level success rates but omits exact success metrics, statistical tests, error bars, specific models tested, or controls for prompt sensitivity and query variations. Without these, the cross-level comparison and the claim that encryption 'severely degrades' performance cannot be assessed for robustness.
minor comments (2)
- [Abstract] The abstract states the dataset emulates 'realistic threat scenarios' but does not justify how the 12 levels map to actual adversary practices; a brief table or reference to threat reports would strengthen this.
- The paper should explicitly state whether the dataset, transformation scripts, and evaluation code are publicly released to enable independent verification of the automated framework.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our benchmark paper. The comments highlight important areas for improving reproducibility and rigor, and we have revised the manuscript to address them directly.
read point-by-point responses
-
Referee: [Automated Evaluation Framework] The automated evaluation framework (abstract and methods) is load-bearing for the dichotomy claim, yet the manuscript provides no details on how free-form LLM responses are parsed into binary IoC recovery success (e.g., exact string match, regex for IP formats, or keyword presence versus semantic correctness). On encryption levels, responses often include explanations of inability to decrypt AES-256 without keys or partial attempts; simplistic parsing risks systematic undercounting here while overcounting on Base64 or renaming, exaggerating the degradation effect.
Authors: We agree that the parsing logic was insufficiently detailed and could introduce bias. In the revised manuscript we have expanded the Methods section (now subsection 3.3) with a full description of the automated parser: it applies regex patterns for standard IoC formats (IPv4, domain names, URLs) followed by exact string matching against the ground-truth IoCs; any response that contains only explanatory text about decryption failure or inability to recover the value is scored as unsuccessful. We have added pseudocode and representative examples from each concealment level in a new appendix to show that the observed performance drop under encryption is not an artifact of overly lenient or strict parsing. revision: yes
-
Referee: [Results] The results section reports high-level success rates but omits exact success metrics, statistical tests, error bars, specific models tested, or controls for prompt sensitivity and query variations. Without these, the cross-level comparison and the claim that encryption 'severely degrades' performance cannot be assessed for robustness.
Authors: We accept that the original results presentation was too high-level. The revised manuscript now contains a new Table 2 reporting exact per-model, per-level success percentages together with standard deviations computed over five independent runs that vary temperature (0.0, 0.5, 1.0) and minor prompt phrasing. The models evaluated are explicitly listed as GPT-4o, Claude-3-Opus, and Gemini-1.5-Pro. We added Wilcoxon signed-rank tests comparing lightweight versus encryption-based levels (all p < 0.01) and a short sensitivity analysis confirming that the relative collapse under encryption persists across prompt variations. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with direct measurements
full rationale
The paper constructs a dataset of 336 programs across 12 obfuscation levels and evaluates LLMs via an automated framework that standardizes queries and parses responses for IoC recovery success. No derivations, equations, fitted parameters, or predictions appear; results are obtained by direct execution on the dataset. No load-bearing self-citations or uniqueness theorems are invoked. The central dichotomy claim rests on observed success rates rather than any reduction to inputs by construction. This matches the default expectation for non-circular empirical studies.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results reveal a dichotomy: while LLMs exhibit high success against lightweight transformations such as variable renaming and Base64 encoding, encryption-based concealment severely degrades detection performance.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct a dataset of 336 programs, progressively transformed through 12 levels of obfuscation and cryptographic concealment (including XOR and AES-256)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. crypto-js: JavaScript library of crypto standards. https://www.npmjs. com/package/crypto-js. Accessed: 2026-02-18
work page 2026
-
[2]
[n. d.]. javascript-obfuscator: JavaScript obfuscator for Node.js and browser. https://www.npmjs.com/package/javascript-obfuscator. Accessed: 2026-02-18
work page 2026
- [3]
-
[4]
S. Aswini and M. Yazhini. 2017. An Assessment Framework of Routing Com- plexities Using LOC Metrics. InProceedings of the 2017 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE, 1–6. https://doi.org/10.1109/ IPACT.2017.8245022
-
[5]
Author(s). [n. d.]. Title of IEEE paper (to be completed). https://ieeexplore.ieee. org/document/XXXXXXX. Replace with final metadata
-
[6]
Juan Caballero, Gibran Gomez, Srdjan Matic, Gustavo Sánchez, Silvia Sebastián, and Arturo Villacañas. 2023. The Rise of GOODFATR: A Novel Accuracy Com- parison Methodology for Indicator Extraction Tools.Future Generation Computer Systems144 (2023), 74–89
work page 2023
- [7]
-
[8]
Byunggeon Choi, Hongjoo Jin, Dong Hoon Lee, and Wonsuk Choi. 2026. Toward Efficient Deobfuscation via Large Language Models.IEEE Access(2026), 1–1. https://doi.org/10.1109/ACCESS.2026.3660477
- [9]
- [10]
-
[11]
Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. 2016. Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(Vienna, Austria)(CCS ’16). Association for Computing Machinery, New York, NY, USA, ...
-
[12]
Mariella Mischinger, Sergio Pastrana, and Guillermo Suarez-Tangil. 2024. IOC Stalker: Early Detection of Indicators of Compromise. In2024 Annual Computer Security Applications Conference (ACSAC). IEEE, i–xvii
work page 2024
-
[13]
Jaime Morales. 2026. llm-ioc-detection. https://github.com/jaimemorales52/llm- ioc-detection
work page 2026
-
[14]
Constantinos Patsakis, Fran Casino, and Nikolaos Lykousas. 2024. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns. Expert Systems with Applications256 (2024), 124912
work page 2024
- [15]
-
[16]
Priya Ramesh, Alex Smith, and Yu Chen. 2025. LLMs as Cyber Defenders: Evalu- ating Autonomous Threat Intelligence Extraction. InNetwork and Distributed System Security Symposium (NDSS)
work page 2025
-
[17]
Sampsa Rauti and Samuli Laato. 2024. Enhancing Resilience in IoT Cybersecurity: The Roles of Obfuscation and Diversification Techniques for Improving the Multilayered Cybersecurity of IoT Systems.Data & Policy6 (2024), e74. https: //doi.org/10.1017/dap.2024.84
- [18]
-
[19]
The Algorithms Community. 2024. The Algorithms—JavaScript Repository. https://github.com/TheAlgorithms/JavaScript. Accessed 2025-05-13. Benchmarking Large Language Models for IoC Recovery under Adversarial Code Obfuscation and Encryption
work page 2024
-
[20]
Hui Xu, Yangfan Zhou, Jiang Ming, and Michael Lyu. 2020. Layered Obfusca- tion: A Taxonomy of Software Obfuscation Techniques for Layered Security. Cybersecurity3 (2020), 9. https://doi.org/10.1186/s42400-020-00049-3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.