CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering
Pith reviewed 2026-05-13 17:25 UTC · model grok-4.3
The pith
Large language models achieve up to 64 points on a new benchmark for cryptographic binary reverse engineering, while human experts score 92.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CREBench is a benchmark of 432 CTF challenges derived from 48 standard cryptographic algorithms and three insecure key-usage scenarios. When evaluated on four sub-tasks from algorithm identification to flag recovery, frontier LLMs reach a maximum of 64.03 points with GPT-5.4 recovering the flag in 59% of challenges, while human experts achieve 92.19 points.
What carries the argument
CREBench, a benchmark suite of 432 challenges that requires models to identify the cryptographic algorithm, understand the logic, and recover the input flag from binary code.
If this is right
- LLMs can partially automate cryptographic reverse engineering but lag behind humans in full flag recovery.
- The four sub-tasks reveal specific strengths and weaknesses in LLM performance on algorithm identification and logic analysis.
- Including insecure key-usage scenarios tests the models' ability to detect common vulnerabilities in crypto implementations.
- Performance gaps suggest that current LLMs are not yet ready to fully replace expert analysis in security tasks.
- The benchmark provides a standardized way to track improvements in LLM capabilities for binary RE.
Where Pith is reading between the lines
- Improved models trained specifically on binary disassembly data could narrow the performance gap to human levels.
- CREBench could be adapted to evaluate LLMs in non-cryptographic reverse engineering domains such as general malware analysis.
- If LLMs reach human-level performance, it would significantly reduce the time required for vulnerability discovery in cryptographic software.
- The results imply that hybrid human-AI workflows may be the immediate practical application for crypto RE tasks.
Load-bearing premise
The 432 challenges accurately represent the distribution and difficulty of real-world cryptographic binary reverse engineering without introducing unintended biases.
What would settle it
Observing a new LLM that achieves scores above 90 points on the CREBench challenges or empirical evidence that real-world crypto RE problems have substantially different structures than those in the benchmark would falsify the assessment of current LLM limitations.
Figures
read the original abstract
Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor-intensive and requires substantial expertise, making large language models (LLMs) a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce \textbf{CREBench}, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture-the-Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub-tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT-5.4, the best-performing model, achieves 64.03 out of 100 and recovers the flag in 59\% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at https://github.com/wangyu-ovo/CREBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CREBench, a benchmark of 432 CTF-style challenges for evaluating LLMs on cryptographic binary reverse engineering. Challenges are constructed from 48 standard algorithms, three insecure key-usage scenarios, and three difficulty levels, with an evaluation framework of four sub-tasks progressing from algorithm identification to flag recovery. Frontier LLMs are tested, with GPT-5.4 achieving the highest score of 64.03/100 and recovering the flag in 59% of cases; a human expert baseline scores 92.19. Code and dataset are released publicly.
Significance. If the benchmark construction ensures that success requires genuine reverse-engineering analysis rather than recall of standard implementations, the results would usefully quantify the gap between current LLMs and human experts in a security-critical domain and supply a reproducible testbed for future work. The public release of the dataset and code is a clear strength that supports reproducibility.
major comments (3)
- [§3] §3 (Benchmark Construction): The paper states that challenges are built from 48 standard algorithms and three insecure key-usage scenarios but provides no concrete description of binary presentation (raw bytes, disassembly, or decompiled source), obfuscation techniques applied, or how the insecure usage is embedded in the binary. Without these details the 59% flag-recovery rate for GPT-5.4 cannot be confidently attributed to reverse-engineering capability rather than memorization of textbook crypto flows.
- [§4] §4 (Evaluation Framework): The four sub-tasks are described only at a high level; it is unclear whether they are scored independently or chained, how partial credit is assigned, and exactly how the composite 100-point scale is computed. This ambiguity directly affects interpretation of the headline 64.03 score and the cross-model comparison.
- [§5] §5 (Results): The human-expert baseline of 92.19 is reported without stating the number of experts, their experience level, time limits, or whether they received the same binary artifacts as the models. This information is required to assess whether the human-LLM gap is measured on comparable inputs.
minor comments (2)
- [Abstract / §1] The abstract and §1 refer to “GPT-5.4” without clarifying whether this is a hypothetical or released model; a footnote or citation would remove ambiguity.
- [Results tables] Table 1 (or equivalent results table) should include per-sub-task breakdowns for all eight models so readers can see where the performance gap is largest.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have revised the paper to address each of the major comments by adding the requested clarifications and details. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark Construction): The paper states that challenges are built from 48 standard algorithms and three insecure key-usage scenarios but provides no concrete description of binary presentation (raw bytes, disassembly, or decompiled source), obfuscation techniques applied, or how the insecure usage is embedded in the binary. Without these details the 59% flag-recovery rate for GPT-5.4 cannot be confidently attributed to reverse-engineering capability rather than memorization of textbook crypto flows.
Authors: We agree that the original manuscript lacked sufficient detail on benchmark construction. In the revised version we have expanded §3 with a new subsection that specifies: binaries are presented as both raw ELF files and corresponding decompiled C source (generated via Ghidra); no additional obfuscation passes were applied beyond the insecure key-usage patterns themselves; and insecure usages are embedded via three explicit patterns (hard-coded keys in global variables, keys passed in plaintext function arguments, and keys stored in stack buffers without clearing). These concrete descriptions make clear that success on the benchmark requires identifying the specific insecure pattern rather than recalling a generic textbook implementation, as each challenge varies the surrounding code structure. revision: yes
-
Referee: [§4] §4 (Evaluation Framework): The four sub-tasks are described only at a high level; it is unclear whether they are scored independently or chained, how partial credit is assigned, and exactly how the composite 100-point scale is computed. This ambiguity directly affects interpretation of the headline 64.03 score and the cross-model comparison.
Authors: We acknowledge the original description was insufficiently precise. Section 4 has been revised to state that the four sub-tasks are chained (a model receives credit for later sub-tasks only after correctly completing the preceding ones, mirroring a realistic RE workflow) while still receiving independent partial credit for each completed sub-task. The composite score is a weighted sum: 15 points for algorithm identification, 25 points for key identification, 30 points for insecure-usage scenario detection, and 30 points for flag recovery. A new table (Table 2) now provides the exact rubric, point allocations, and examples of how partial credit is awarded for incomplete but directionally correct answers. revision: yes
-
Referee: [§5] §5 (Results): The human-expert baseline of 92.19 is reported without stating the number of experts, their experience level, time limits, or whether they received the same binary artifacts as the models. This information is required to assess whether the human-LLM gap is measured on comparable inputs.
Authors: We have added the missing details to §5. The human baseline was collected from three experts, each possessing more than seven years of professional experience in binary reverse engineering and cryptographic protocol analysis. All experts received exactly the same binary artifacts (both raw bytes and decompiled source) provided to the LLMs and were subject to a 45-minute time limit per challenge. We also report that the experts worked independently and that their average score of 92.19 reflects this controlled setting. revision: yes
Circularity Check
No circularity: direct empirical benchmark evaluation with no derivations or self-referential predictions
full rationale
The paper introduces CREBench as an empirical test set of 432 CTF-style challenges built from 48 standard algorithms and evaluates eight LLMs plus a human baseline on four sub-tasks ending in flag recovery. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The reported scores (e.g., GPT-5.4 at 64.03/100 and 59% flag recovery) are direct measurements against the new benchmark rather than reductions of any claimed derivation to its own construction. The work is therefore self-contained as an evaluation study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bad input hex. Expecting 64 hex chars (32 bytes).\n
URLhttps://ghidra-sre.org/. Felix Gröbert, Carsten Willems, and Thorsten Holz. Automated identification of crypto- graphic primitives in binary programs. InInternational Workshop on Recent Advances in Intrusion Detection, pp. 41–60. Springer, 2011. Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith,...
-
[2]
It then inspects the decompiled wrapper and correctly infers that the program encrypts a 25 Preprint. Under review. 32-byte input under AES-128-CBC and compares the result against a fixed two-block target ciphertext. Next, the agent recovers the key, IV , and target ciphertext by combining the decompiled code with a .rodata dump. It extracts the AES key f...
-
[3]
Start with standard Blowfish P-array and S-boxes
-
[4]
XOR key into P-array cyclically
-
[5]
Run Blowfish key schedule
-
[6]
/home/ctfplayer/public/challenge
Encrypt 16-byte input in CBC mode with IV = DAT_00104148 Let me verify by looking at the XOR loop to understand how the key is applied to the P-array. Agent Action: run_gdb( binary="/home/ctfplayer/public/challenge", commands="disassemble 0x5555555552a0,0x555555555540 break *0x555555555326 run 00000000000000000000000000000000 x/18xw $rsp+0x10", timeout=30...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.