pith. sign in

arxiv: 2605.29490 · v1 · pith:MOFYUQTMnew · submitted 2026-05-28 · 💻 cs.SE · cs.CR

CODEFUSE-DEBENCH: An Empirical Study on Readability, Recompilability, and Functionality

Pith reviewed 2026-06-29 06:52 UTC · model grok-4.3

classification 💻 cs.SE cs.CR
keywords decompilationreusabilityreadabilityrecompilabilityfunctionalitydecompilersLLM repairdynamic tracing
0
0 comments X

The pith

The best decompiler-LLM pair reaches 22.3% behavioral overlap but only 1.2% exact stdout match on test binaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that decompilation must be measured by practical reusability across three dimensions rather than syntactic similarity alone. It builds an automated benchmark with 240 functions turned into 640 binaries and applies LLM judging for readability, compile-repair loops for recompilability, and dynamic tracing for functionality. Results show a steep performance drop from recompilation rates to actual behavioral equivalence, with compiler and optimization choices creating larger gaps than the choice of repair model. This matters to a reader because it quantifies why decompiled code often remains unusable in practice despite appearing close to source. The work also classifies failure types into syntactic noise, type collapse, and permanent upstream losses.

Core claim

DEBENCH shows that the reusability cliff is steep, with the best decompiler-LLM pair reaching 22.3% Exact+Partial program-level behavioral overlap but only 1.2% exact stdout match, nearly 50 points below recompilability. Settings that maximize readability do not maximize functionality, as -O3 yields the lowest readability but highest functionality while Clang produces 2.6 times higher functionality than GCC despite lower readability. Cross-decompiler functional variation reaches 20 times the scale of cross-LLM variation, and repair failures split into syntactic noise, type-system collapse in about 19% of cases, and irreversible upstream losses from ARM64 relocation idioms and C++ ABI feature

What carries the argument

DEBENCH, an automated framework that evaluates five decompilers and three repair LLMs across readability via LLM-as-judge plus 18-subdimension URAF scoring, recompilability via fixed 50-iteration compile-and-repair, and functionality via Frida differential tracing at program, function, and instruction levels on 240 atomic test functions.

If this is right

  • Decompiler engines drive functional outcomes far more than repair LLMs, since cross-decompiler variation is 20x while cross-LLM variation is only 1.6x.
  • Optimization level and compiler choice create direct trade-offs, with -O3 delivering highest functionality at the cost of lowest readability and Clang outperforming GCC by 2.6x on functionality.
  • Roughly 19% of repair errors come from type-system collapse, pointing to a recurring limit in current repair pipelines.
  • Certain upstream losses from ARM64 relocation idioms and C++ ABI features cannot be recovered by repair models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Decompiler development should target type inference and architecture-specific idioms before further scaling of repair models.
  • The three-dimension evaluation could be applied to larger codebases to test whether the reusability cliff widens or narrows with program size.
  • Teams recovering binaries for reuse may achieve better results by trying multiple decompilers rather than a single engine plus LLM repair.

Load-bearing premise

The 240 atomic test functions compiled into 640 binaries sufficiently represent the range of real-world decompilation challenges for measuring practical reusability.

What would settle it

Running the same five decompilers and three LLMs on a separate corpus of larger real-world open-source binaries and measuring whether the 22.3% to 1.2% gap and 20x decompiler variation persist or shift substantially.

Figures

Figures reproduced from arXiv: 2605.29490 by Jianlei Chi, Peng Di, Puzhuo Liu, Yuhan Huang, Yu Jiang.

Figure 1
Figure 1. Figure 1: The workflow of modern binary decompiler. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The decompiled result of the same function. Angr appears cleaner but returns an incorrect [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The evaluation pipeline of DEBENCH. • Ground truth in this paper is the original source code and the original binary’s runtime behavior: source for readability and recompilability comparison, runtime traces for functionality. The LLM judge does not act as ground truth; it acts as a rater whose judgement is constrained by the source-code reference and cross-checked against the deterministic recompilability … view at source ↗
Figure 4
Figure 4. Figure 4: Cumulative compile-success rate vs. repair iteration over the full 9,066-trace corpus. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Readability across eight dimensions. across architectures, while Angr, RetDec, and even BinaryAI are more volatile, the easier it is to be stumped by specific architectures. O0 O1 O2 O3 Os Optimization Options 4.0 4.5 5.0 5.5 6.0 Overall Readability (Mean of L1-L5) IDA Ghidra BinaryAI RetDec Angr [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall readability across optimization options. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Dimension-specific debug gain of IDA. Each cell reports the improvement brought by [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Readability across optimizations w/wo-debug. 5.3 Recompilability [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Outcome distribution by decompilers. Finding 2: Among 428k repair-error instances mined from the trace logs, type-related categories (Conflicting Types, Incompatible Pointer Type, Unknown Type, Incomplete Type, Member Access Er￾ror, Void Value Error) collectively account for ∼19%—comparable to all non-syntax categories combined. Type-system reconstruction, not surface syntax, is the dominant cost of recomp… view at source ↗
Figure 10
Figure 10. Figure 10: IDA-only FS/LF/CF distribution across dimensions. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Compile-success rate across target architectures. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: IDA-only FS/LF/CF rates across optimizations. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Effective repair effort inside failure cases. Each bar is normalized to the initial error total. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Program-level outcome distribution. 5.4.1 Single-Factor Effects We analyze the impact of each variable on functionality. Compiler. Compiler creates the clearest global separation—in the direction opposite to readability. Clang substantially outperforms GCC at the observable program level, reaching 26.5% Exact+Partial versus only 10.2% for GCC. Clang also exposes more function evidence (69.2% vs. 52.8%) an… view at source ↗
Figure 15
Figure 15. Figure 15: Function I/O match across dimensions. Empty cells indicate that there is no comparable [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
read the original abstract

Binary decompilation aims to recover binaries into high-level source code, but existing evaluations mainly rely on syntactic similarity or single-axis readability metrics, which fail to capture practical reusability. We propose a reusability-driven evaluation paradigm that measures decompiler quality along three orthogonal dimensions: readability, recompilability, and functionality. We present DEBENCH, the first automated framework for multidimensional decompilation evaluation. DEBENCH contains 240 atomic test functions, organized into 8 source files and compiled into 640 binaries. It combines LLM-as-judge readability scoring with URAF (18 sub-dimensions), iterative compile-and-repair under a fixed 50-iteration budget, and Frida-based differential dynamic tracing at the program, function, and instruction levels. We evaluate five mainstream decompilers and three repair LLMs. Our study reveals four findings. First, the reusability cliff is steep: the best decompiler-LLM pair reaches 22.3% Exact+Partial program-level behavioral overlap but only 1.2% exact stdout match, nearly 50 points below recompilability. Second, settings that maximize readability do not maximize functionality: -O3 yields the lowest readability but the highest functionality, and Clang gives lower readability than GCC but 2.6x higher functionality. Third, cross-decompiler variation at the functional level is 20x, far larger than the 1.6x cross-LLM variation, showing that progress depends more on decompiler engines than larger repair models. Fourth, failures fall into three categories: syntactic noise, type-system collapse (about 19% of repair errors), and irreversible upstream losses such as ARM64 relocation idioms and C++ ABI features.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DEBENCH, the first automated framework for multidimensional decompilation evaluation, using 240 atomic test functions compiled into 640 binaries. It measures decompiler quality (five decompilers, three repair LLMs) along readability (LLM-as-judge with URAF 18-subdimension scoring), recompilability (iterative compile-and-repair, 50-iteration budget), and functionality (Frida-based differential dynamic tracing at program/function/instruction levels). The study reports four findings: a steep reusability cliff (best pair at 22.3% Exact+Partial program-level behavioral overlap but 1.2% exact stdout match), settings maximizing readability do not maximize functionality (-O3 lowest readability but highest functionality; Clang 2.6x higher functionality than GCC), 20x cross-decompiler vs 1.6x cross-LLM functional variation, and three failure categories (syntactic noise, ~19% type-system collapse, irreversible upstream losses like ARM64 relocations and C++ ABI).

Significance. If the empirical results hold, the work establishes a reusability-driven evaluation paradigm that addresses limitations of prior syntactic or single-axis metrics, with concrete numbers from 640 binaries and an automated framework combining LLM judging, compile-repair, and dynamic tracing. Strengths include the orthogonal dimensions, explicit failure categorization, and the finding that decompiler engines matter more than repair-model scale. This provides a reproducible basis for prioritizing decompiler improvements over larger LLMs.

major comments (1)
  1. [DEBENCH description and findings] DEBENCH framework (abstract and evaluation setup): The central reusability-cliff claim (22.3% behavioral overlap vs. 1.2% stdout match) and the four findings rest on 240 atomic test functions. These lack inter-procedural data flow, complex heap layouts, and ABI interactions that dominate real binaries; the paper's own identification of irreversible upstream losses (ARM64 relocations, C++ ABI) indicates the test set may under-sample exactly those dominant failure modes, risking that the measured gap shrinks or reverses on larger programs.
minor comments (2)
  1. [Results] Results section: clarify the precise definition and computation of 'Exact+Partial program-level behavioral overlap' and how it differs from exact stdout match, including any thresholds or aggregation rules across the three tracing levels.
  2. [Methods] Methods: provide explicit data-exclusion rules, error bars or confidence intervals on the reported percentages, and the full list of the 8 source files or 240 functions to support reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of DEBENCH's contributions and for the detailed comment on the evaluation setup. We respond point by point below.

read point-by-point responses
  1. Referee: The central reusability-cliff claim (22.3% behavioral overlap vs. 1.2% stdout match) and the four findings rest on 240 atomic test functions. These lack inter-procedural data flow, complex heap layouts, and ABI interactions that dominate real binaries; the paper's own identification of irreversible upstream losses (ARM64 relocations, C++ ABI) indicates the test set may under-sample exactly those dominant failure modes, risking that the measured gap shrinks or reverses on larger programs.

    Authors: The 240 atomic test functions were deliberately chosen to isolate decompilation effects across the three orthogonal dimensions (readability, recompilability, functionality) while eliminating confounding variables such as inter-procedural data flow or complex heap layouts. This controlled design enables the precise attribution of the 20x cross-decompiler versus 1.6x cross-LLM variation and the identification of the three failure categories, including the irreversible upstream losses (ARM64 relocations, C++ ABI) already highlighted in the manuscript. The reusability-cliff numbers and four findings are explicitly scoped to this reproducible test suite of 640 binaries; the paper does not claim they generalize to arbitrary real-world binaries. Extending DEBENCH to larger programs with richer ABI and data-flow interactions is a natural next step, but the current atomic baseline remains valuable for guiding targeted improvements to decompiler engines. revision: no

Circularity Check

0 steps flagged

No circularity: pure empirical measurement study

full rationale

The paper constructs DEBENCH from 240 atomic test functions compiled to 640 binaries and reports direct execution outcomes (readability scores via LLM-as-judge and URAF, recompilation success under 50-iteration repair, and Frida-based behavioral overlap percentages). No equations, fitted parameters, predictions, or derivations exist; the four findings are raw measurement results. No self-citations are load-bearing for any claim, and the evaluation pipeline is externally falsifiable on the stated test suite without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The evaluation rests on the representativeness of the 240 test functions and the validity of the three chosen dimensions for reusability; no free parameters or invented physical entities are present.

axioms (1)
  • domain assumption The 240 atomic test functions and their compilation into 640 binaries capture the key challenges of practical decompilation.
    The framework and all reported percentages depend on this representativeness claim stated in the abstract.
invented entities (2)
  • DEBENCH framework no independent evidence
    purpose: Automated multidimensional evaluation of decompilers
    New benchmark and tooling introduced by the paper with no external prior reference.
  • URAF readability scoring system no independent evidence
    purpose: 18 sub-dimension readability metric used by LLM judge
    New scoring rubric defined for this study.

pith-pipeline@v0.9.1-grok · 5857 in / 1441 out tokens · 27930 ms · 2026-06-29T06:52:35.623827+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 7 canonical work pages

  1. [1]

    Jordi Armengol-Estap´e, Jackson Woodruff, Alexander Brauckmann, Jos´e Wesley de Souza Magalh˜aes, and Michael F. P . O’Boyle. Exebench: an ml-scale dataset of executable c functions. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, pp. 50–59, New York, NY, USA,

  2. [2]

    ISBN 9781450392730

    Association for Computing Machinery. ISBN 9781450392730. doi: 10.1145/3520312.3534867. URLhttps://doi.org/10.1145/3520312.3534867. Avast. Retdec.https://github.com/avast/retdec/,

  3. [3]

    David Brumley, JongHyup Lee, Edward J Schwartz, and Maverick Woo

    Accessed 2026-1-1. David Brumley, JongHyup Lee, Edward J Schwartz, and Maverick Woo. Native x86 decompilation using {Semantics-Preserving} structural analysis and iterative {Control-Flow} structuring. In 22nd USENIX Security Symposium (USENIX Security 13), pp. 353–368,

  4. [4]

    Evaluating the effectiveness of decompilers

    Ying Cao, Runze Zhang, Ruigang Liang, and Kai Chen. Evaluating the effectiveness of decompilers. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, pp. 491–502, New York, NY, USA,

  5. [5]

    ISBN 9798400706127

    Association for Computing Machinery. ISBN 9798400706127. doi: 10.1145/3650212.3652144. URL https://doi.org/10.1145/3650212.3652

  6. [6]

    Arm adr/adrp demos

    Cliff Fan. Arm adr/adrp demos. https://duetorun.com/blog/20230609/arm-adr-demo,

  7. [7]

    Han Gao, Shaoyin Cheng, Yinxing Xue, and Weiming Zhang

    Accessed 2026-1-1. Han Gao, Shaoyin Cheng, Yinxing Xue, and Weiming Zhang. A lightweight framework for function name reassignment based on large-scale stripped binaries. InProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 607–619,

  8. [8]

    Decompilebench: A comprehensive benchmark for evaluating decompilers in real-world scenarios

    Zeyu Gao, Yuxin Cui, Hao Wang, Siliang Qin, Yuanda Wang, Zhang Bolun, and Chao Zhang. Decompilebench: A comprehensive benchmark for evaluating decompilers in real-world scenarios. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 23250–23267,

  9. [9]

    Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev

    URLhttps://arxiv.org/abs/2503.06643. Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev. Debin: Predicting debug information in stripped binaries. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 1667–1680,

  10. [10]

    Peiwei Hu, Ruigang Liang, and Kai Chen

    Accessed 2025-1-1. Peiwei Hu, Ruigang Liang, and Kai Chen. Degpt: Optimizing decompiler output with llm. In Proceedings 2024 Network and Distributed System Security Symposium, volume 267622140,

  11. [11]

    Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings

    Xin Jin, Kexin Pei, Jun Yeon Won, and Zhiqiang Lin. Symlm: Predicting function names in stripped binaries via context-sensitive execution-aware code embeddings. InProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp. 1631–1645,

  12. [12]

    Dire: A neural approach to decompiled identifier naming

    Jeremy Lacomis, Pengcheng Yin, Edward Schwartz, Miltiadis Allamanis, Claire Le Goues, Graham Neubig, and Bogdan Vasilescu. Dire: A neural approach to decompiled identifier naming. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 628–639. IEEE,

  13. [13]

    Finding vulnerabilities in internal-binary of firmware with clues

    Puzhuo Liu, Dongliang Fang, Chuan Qin, Kai Cheng, Shichao Lv, Hongsong Zhu, and Limin Sun. Finding vulnerabilities in internal-binary of firmware with clues. InICC 2022-IEEE International Conference on Communications, pp. 5397–5402. IEEE,

  14. [14]

    Jiaqian Peng, Puzhuo Liu, Yicheng Zeng, Kai Cheng, Yongji Liu, Yun Yang, and Hongsong Zhu

    Accessed 2026-1-1. Jiaqian Peng, Puzhuo Liu, Yicheng Zeng, Kai Cheng, Yongji Liu, Yun Yang, and Hongsong Zhu. Bridge: High-Order Taint Vulnerabilities Detection in Linux-based IoT Firmware . In2026 IEEE Symposium on Security and Privacy (SP), pp. 2659–2678, Los Alamitos, CA, USA, May

  15. [15]

    doi: 10.1109/SP63933.2026.00001

    IEEE Computer Society. doi: 10.1109/SP63933.2026.00001. URL https://doi.ieeecomputersociety. org/10.1109/SP63933.2026.00001. Ole Andr´e Ravn˚as. Frida: A world-class dynamic instrumentation toolkit. https://frida.re/ ,

  16. [16]

    Benchmarking binary type inference techniques in decompilers

    Vedant Soni, Audrey Dutcher, Tiffany Bao, and Ruoyu Wang. Benchmarking binary type inference techniques in decompilers. InProceedings of the 2025 Workshop on Software Understanding and Reverse Engineering, SURE ’25, pp. 48–60, New York, NY, USA,

  17. [17]

    ISBN 9798400719103

    Association for Computing Machinery. ISBN 9798400719103. doi: 10.1145/3733822.3764675. URL https://doi.org/10.114 5/3733822.3764675. 30 DeBench Akihiro Suda and Lima contributors. Lima: Linux virtual machines.https://lima-vm.io/,

  18. [18]

    Llm4decompile: Decompiling binary code with large language models

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. Llm4decompile: Decompiling binary code with large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 3473–3487,

  19. [19]

    Tencent Security

    URLhttps://arxiv.org/abs/2505.12668. Tencent Security. BinaryAI: Binary code analysis with artificial intelligence. https://www.binary ai.net/,

  20. [20]

    Resym: Harness- ing llms to recover variable and data structure symbols from stripped binaries

    Danning Xie, Zhuo Zhang, Nan Jiang, Xiangzhe Xu, Lin Tan, and Xiangyu Zhang. Resym: Harness- ing llms to recover variable and data structure symbols from stripped binaries. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 4554–4568,