CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering

Baicheng Chen; Juanru Li; Tianxing He; Xiangru Liu; Yilei Chen; Yu Wang; Ziheng Zhou

arxiv: 2604.03750 · v1 · submitted 2026-04-04 · 💻 cs.CR · cs.AI· cs.CL

CREBench: Evaluating Large Language Models in Cryptographic Binary Reverse Engineering

Baicheng Chen , Yu Wang , Ziheng Zhou , Xiangru Liu , Juanru Li , Yilei Chen , Tianxing He This is my paper

Pith reviewed 2026-05-13 17:25 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords large language modelscryptographic reverse engineeringbenchmarkbinary analysiscapture the flagsecurityLLM evaluationcrypto algorithms

0 comments

The pith

Large language models achieve up to 64 points on a new benchmark for cryptographic binary reverse engineering, while human experts score 92.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CREBench to systematically test how well large language models can reverse engineer cryptographic binaries in a capture-the-flag style. It builds 432 challenges from 48 algorithms and three insecure key usage scenarios across three difficulty levels. The evaluation shows that even the best model, GPT-5.4, scores 64 out of 100 and recovers the flag in only 59 percent of cases. This establishes a human expert baseline at 92 points, indicating that current LLMs still fall short in automating this security-critical task. A sympathetic reader would care because reverse engineering crypto code is essential for vulnerability discovery and malware analysis, and automating it could speed up security work.

Core claim

CREBench is a benchmark of 432 CTF challenges derived from 48 standard cryptographic algorithms and three insecure key-usage scenarios. When evaluated on four sub-tasks from algorithm identification to flag recovery, frontier LLMs reach a maximum of 64.03 points with GPT-5.4 recovering the flag in 59% of challenges, while human experts achieve 92.19 points.

What carries the argument

CREBench, a benchmark suite of 432 challenges that requires models to identify the cryptographic algorithm, understand the logic, and recover the input flag from binary code.

If this is right

LLMs can partially automate cryptographic reverse engineering but lag behind humans in full flag recovery.
The four sub-tasks reveal specific strengths and weaknesses in LLM performance on algorithm identification and logic analysis.
Including insecure key-usage scenarios tests the models' ability to detect common vulnerabilities in crypto implementations.
Performance gaps suggest that current LLMs are not yet ready to fully replace expert analysis in security tasks.
The benchmark provides a standardized way to track improvements in LLM capabilities for binary RE.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved models trained specifically on binary disassembly data could narrow the performance gap to human levels.
CREBench could be adapted to evaluate LLMs in non-cryptographic reverse engineering domains such as general malware analysis.
If LLMs reach human-level performance, it would significantly reduce the time required for vulnerability discovery in cryptographic software.
The results imply that hybrid human-AI workflows may be the immediate practical application for crypto RE tasks.

Load-bearing premise

The 432 challenges accurately represent the distribution and difficulty of real-world cryptographic binary reverse engineering without introducing unintended biases.

What would settle it

Observing a new LLM that achieves scores above 90 points on the CREBench challenges or empirical evidence that real-world crypto RE problems have substantially different structures than those in the benchmark would falsify the assessment of current LLM limitations.

Figures

Figures reproduced from arXiv: 2604.03750 by Baicheng Chen, Juanru Li, Tianxing He, Xiangru Liu, Yilei Chen, Yu Wang, Ziheng Zhou.

**Figure 2.** Figure 2: Comparison of LLMs’ performance on CREBench. Pass@3 performance by model [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A successful case: GPT-5.4 solves the AES-128-CBC challenge in 9 rounds. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Average pass@3 performance across models under different difficulty settings and [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Pass@3 perfect rate across eight evaluated models on CREBench. A challenge [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Average tokens per challenge and overall pass@3 score by model. Model Avg. Avg. Delta (baseline) (2x) ClaudeSonnet-4.6 48.7 50.0 +1.3 Gemini2.5-Pro 36.2 37.9 +1.7 DoubaoSeed-1.8 27.4 28.6 +1.2 [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Average pass@3 scores across the three insecure key modes (hardcoded, frag [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Pairwise correlations among the four sub-tasks [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

**Figure 9.** Figure 9: A successful reverse-engineering trajectory by GPT-5.4 on a Serpent challenge [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: A failed reverse-engineering trajectory by Doubao-Seed-1.8 on an Anubis [PITH_FULL_IMAGE:figures/full_fig_p039_10.png] view at source ↗

**Figure 11.** Figure 11: A failed reverse-engineering trajectory by Claude-Sonnet-4.6 on a Blowfish [PITH_FULL_IMAGE:figures/full_fig_p046_11.png] view at source ↗

read the original abstract

Reverse engineering (RE) is central to software security, particularly for cryptographic programs that handle sensitive data and are highly prone to vulnerabilities. It supports critical tasks such as vulnerability discovery and malware analysis. Despite its importance, RE remains labor-intensive and requires substantial expertise, making large language models (LLMs) a potential solution for automating the process. However, their capabilities for RE remain systematically underexplored. To address this gap, we study the cryptographic binary RE capabilities of LLMs and introduce \textbf{CREBench}, a benchmark comprising 432 challenges built from 48 standard cryptographic algorithms, 3 insecure crypto key usage scenarios, and 3 difficulty levels. Each challenge follows a Capture-the-Flag (CTF) RE challenge, requiring the model to analyze the underlying cryptographic logic and recover the correct input. We design an evaluation framework comprising four sub-tasks, from algorithm identification to correct flag recovery. We evaluate eight frontier LLMs on CREBench. GPT-5.4, the best-performing model, achieves 64.03 out of 100 and recovers the flag in 59\% of challenges. We also establish a strong human expert baseline of 92.19 points, showing that humans maintain an advantage in cryptographic RE tasks. Our code and dataset are available at https://github.com/wangyu-ovo/CREBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CREBench gives a first structured benchmark for LLM crypto binary RE with GPT-5.4 at 59% flag recovery against humans at 92, but the challenges built from 48 textbook algorithms may let models rely on memorization instead of actual analysis.

read the letter

The main point is that this paper introduces CREBench, a benchmark of 432 CTF-style challenges for testing LLMs on cryptographic binary reverse engineering. The best model, GPT-5.4, scores 64 out of 100 and recovers the flag in 59 percent of cases, compared to a human expert baseline of 92. What the paper does is construct challenges from 48 standard cryptographic algorithms combined with three insecure key-usage scenarios at three difficulty levels. It breaks evaluation into four sub-tasks that progress from algorithm identification to full flag recovery. They run eight frontier LLMs through this and make the code and dataset public. The human baseline is a nice addition because it gives context for how far the models still have to go. This setup is new in focusing specifically on crypto binaries rather than general code RE. It provides a repeatable testbed that security researchers could use to measure progress on automating parts of malware analysis or vulnerability discovery. The soft spot is the risk that models are not really doing reverse engineering. Because the challenges start from well-known algorithms, LLMs might succeed by recognizing common implementations from their training data instead of analyzing the binary structure. The abstract does not describe the exact form of the input to the models, such as raw bytes, disassembly, or decompiled code, nor does it mention any obfuscation or compiler-specific artifacts that would make pattern matching harder. Without those details, it is difficult to know if the 59 percent recovery rate would hold for more realistic binaries. The human baseline remains credible, but the gap might not translate directly to practical settings. This paper is for people who evaluate LLMs on security tasks or who want baselines for automated RE tools. A reader interested in tracking LLM capabilities in code analysis would find the task breakdown and scores useful. It is worth sending for peer review because the benchmark itself is a concrete contribution, even if the current results need more scrutiny on how the challenges were built. I recommend putting it through review with attention to the challenge construction and potential memorization issues.

Referee Report

3 major / 2 minor

Summary. The paper introduces CREBench, a benchmark of 432 CTF-style challenges for evaluating LLMs on cryptographic binary reverse engineering. Challenges are constructed from 48 standard algorithms, three insecure key-usage scenarios, and three difficulty levels, with an evaluation framework of four sub-tasks progressing from algorithm identification to flag recovery. Frontier LLMs are tested, with GPT-5.4 achieving the highest score of 64.03/100 and recovering the flag in 59% of cases; a human expert baseline scores 92.19. Code and dataset are released publicly.

Significance. If the benchmark construction ensures that success requires genuine reverse-engineering analysis rather than recall of standard implementations, the results would usefully quantify the gap between current LLMs and human experts in a security-critical domain and supply a reproducible testbed for future work. The public release of the dataset and code is a clear strength that supports reproducibility.

major comments (3)

[§3] §3 (Benchmark Construction): The paper states that challenges are built from 48 standard algorithms and three insecure key-usage scenarios but provides no concrete description of binary presentation (raw bytes, disassembly, or decompiled source), obfuscation techniques applied, or how the insecure usage is embedded in the binary. Without these details the 59% flag-recovery rate for GPT-5.4 cannot be confidently attributed to reverse-engineering capability rather than memorization of textbook crypto flows.
[§4] §4 (Evaluation Framework): The four sub-tasks are described only at a high level; it is unclear whether they are scored independently or chained, how partial credit is assigned, and exactly how the composite 100-point scale is computed. This ambiguity directly affects interpretation of the headline 64.03 score and the cross-model comparison.
[§5] §5 (Results): The human-expert baseline of 92.19 is reported without stating the number of experts, their experience level, time limits, or whether they received the same binary artifacts as the models. This information is required to assess whether the human-LLM gap is measured on comparable inputs.

minor comments (2)

[Abstract / §1] The abstract and §1 refer to “GPT-5.4” without clarifying whether this is a hypothetical or released model; a footnote or citation would remove ambiguity.
[Results tables] Table 1 (or equivalent results table) should include per-sub-task breakdowns for all eight models so readers can see where the performance gap is largest.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have revised the paper to address each of the major comments by adding the requested clarifications and details. Our point-by-point responses follow.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The paper states that challenges are built from 48 standard algorithms and three insecure key-usage scenarios but provides no concrete description of binary presentation (raw bytes, disassembly, or decompiled source), obfuscation techniques applied, or how the insecure usage is embedded in the binary. Without these details the 59% flag-recovery rate for GPT-5.4 cannot be confidently attributed to reverse-engineering capability rather than memorization of textbook crypto flows.

Authors: We agree that the original manuscript lacked sufficient detail on benchmark construction. In the revised version we have expanded §3 with a new subsection that specifies: binaries are presented as both raw ELF files and corresponding decompiled C source (generated via Ghidra); no additional obfuscation passes were applied beyond the insecure key-usage patterns themselves; and insecure usages are embedded via three explicit patterns (hard-coded keys in global variables, keys passed in plaintext function arguments, and keys stored in stack buffers without clearing). These concrete descriptions make clear that success on the benchmark requires identifying the specific insecure pattern rather than recalling a generic textbook implementation, as each challenge varies the surrounding code structure. revision: yes
Referee: [§4] §4 (Evaluation Framework): The four sub-tasks are described only at a high level; it is unclear whether they are scored independently or chained, how partial credit is assigned, and exactly how the composite 100-point scale is computed. This ambiguity directly affects interpretation of the headline 64.03 score and the cross-model comparison.

Authors: We acknowledge the original description was insufficiently precise. Section 4 has been revised to state that the four sub-tasks are chained (a model receives credit for later sub-tasks only after correctly completing the preceding ones, mirroring a realistic RE workflow) while still receiving independent partial credit for each completed sub-task. The composite score is a weighted sum: 15 points for algorithm identification, 25 points for key identification, 30 points for insecure-usage scenario detection, and 30 points for flag recovery. A new table (Table 2) now provides the exact rubric, point allocations, and examples of how partial credit is awarded for incomplete but directionally correct answers. revision: yes
Referee: [§5] §5 (Results): The human-expert baseline of 92.19 is reported without stating the number of experts, their experience level, time limits, or whether they received the same binary artifacts as the models. This information is required to assess whether the human-LLM gap is measured on comparable inputs.

Authors: We have added the missing details to §5. The human baseline was collected from three experts, each possessing more than seven years of professional experience in binary reverse engineering and cryptographic protocol analysis. All experts received exactly the same binary artifacts (both raw bytes and decompiled source) provided to the LLMs and were subject to a 45-minute time limit per challenge. We also report that the experts worked independently and that their average score of 92.19 reflects this controlled setting. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation with no derivations or self-referential predictions

full rationale

The paper introduces CREBench as an empirical test set of 432 CTF-style challenges built from 48 standard algorithms and evaluates eight LLMs plus a human baseline on four sub-tasks ending in flag recovery. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The reported scores (e.g., GPT-5.4 at 64.03/100 and 59% flag recovery) are direct measurements against the new benchmark rather than reductions of any claimed derivation to its own construction. The work is therefore self-contained as an evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark built on standard cryptographic algorithms and existing LLMs; no free parameters, domain axioms, or invented entities are required beyond the benchmark design itself.

pith-pipeline@v0.9.0 · 5559 in / 1039 out tokens · 51642 ms · 2026-05-13T17:25:01.090989+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Bad input hex. Expecting 64 hex chars (32 bytes).\n

URLhttps://ghidra-sre.org/. Felix Gröbert, Carsten Willems, and Thorsten Holz. Automated identification of crypto- graphic primitives in binary programs. InInternational Workshop on Recent Advances in Intrusion Detection, pp. 41–60. Springer, 2011. Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith,...

work page doi:10.1109/spro.2015.10 2011
[2]

Model Output

It then inspects the decompiled wrapper and correctly infers that the program encrypts a 25 Preprint. Under review. 32-byte input under AES-128-CBC and compares the result against a fixed two-block target ciphertext. Next, the agent recovers the key, IV , and target ciphertext by combining the decompiled code with a .rodata dump. It extracts the AES key f...

work page
[3]

Start with standard Blowfish P-array and S-boxes

work page
[4]

XOR key into P-array cyclically

work page
[5]

Run Blowfish key schedule

work page
[6]

/home/ctfplayer/public/challenge

Encrypt 16-byte input in CBC mode with IV = DAT_00104148 Let me verify by looking at the XOR loop to understand how the key is applied to the P-array. Agent Action: run_gdb( binary="/home/ctfplayer/public/challenge", commands="disassemble 0x5555555552a0,0x555555555540 break *0x555555555326 run 00000000000000000000000000000000 x/18xw $rsp+0x10", timeout=30...

work page

[1] [1]

Bad input hex. Expecting 64 hex chars (32 bytes).\n

URLhttps://ghidra-sre.org/. Felix Gröbert, Carsten Willems, and Thorsten Holz. Automated identification of crypto- graphic primitives in binary programs. InInternational Workshop on Recent Advances in Intrusion Detection, pp. 41–60. Springer, 2011. Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith,...

work page doi:10.1109/spro.2015.10 2011

[2] [2]

Model Output

It then inspects the decompiled wrapper and correctly infers that the program encrypts a 25 Preprint. Under review. 32-byte input under AES-128-CBC and compares the result against a fixed two-block target ciphertext. Next, the agent recovers the key, IV , and target ciphertext by combining the decompiled code with a .rodata dump. It extracts the AES key f...

work page

[3] [3]

Start with standard Blowfish P-array and S-boxes

work page

[4] [4]

XOR key into P-array cyclically

work page

[5] [5]

Run Blowfish key schedule

work page

[6] [6]

/home/ctfplayer/public/challenge

Encrypt 16-byte input in CBC mode with IV = DAT_00104148 Let me verify by looking at the XOR loop to understand how the key is applied to the P-array. Agent Action: run_gdb( binary="/home/ctfplayer/public/challenge", commands="disassemble 0x5555555552a0,0x555555555540 break *0x555555555326 run 00000000000000000000000000000000 x/18xw $rsp+0x10", timeout=30...

work page