pith. machine review for the scientific record. sign in

arxiv: 2604.12913 · v1 · submitted 2026-04-14 · 💻 cs.SE · cs.AI· cs.CR

Recognition: unknown

CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:33 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR
keywords binary decompilationLLM refinementreverse engineeringsemantic recoveryadaptive inferencecode re-executability
0
0 comments X

The pith

CoDe-R refines LLM decompiler output with rationale guidance and adaptive fallback to exceed 50% re-executability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CoDe-R, a lightweight two-stage framework that refines the output of large language models used for binary decompilation. The first stage trains the model with Semantic Cognitive Enhancement to recover high-level algorithmic intent through rationale-guided semantic injection. The second stage applies Dynamic Dual-Path Fallback at inference time, using a hybrid verification strategy to balance semantic recovery against syntactic stability. On the HumanEval-Decompile benchmark, this allows a 1.3B model to reach a new state-of-the-art in the lightweight regime by becoming the first such model to surpass 50% average re-executability, outperforming baselines and closing the gap to larger models. A sympathetic reader would care because reliable decompilation from stripped binaries supports reverse engineering without requiring enormous compute resources.

Core claim

CoDe-R establishes that rationale-guided semantic injection during training combined with adaptive dual-path fallback during inference enables lightweight LLMs to reduce logical hallucinations and semantic misalignment in decompilation, producing the first 1.3B model to exceed 50% average re-executability rate on HumanEval-Decompile while outperforming prior lightweight approaches.

What carries the argument

The Semantic Cognitive Enhancement (SCE) rationale-guided injection strategy paired with the Dynamic Dual-Path Fallback (DDPF) adaptive inference mechanism that uses hybrid verification.

If this is right

  • Lightweight models can now reach performance levels previously limited to much larger models on code recovery tasks.
  • The hybrid verification approach may transfer to improve reliability in other LLM-based code generation settings.
  • The gap between efficient models and expert-level decompilation performance narrows without needing to increase model size.
  • Open availability of the method supports wider use in practical reverse engineering pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rationale guidance techniques could apply to other LLM tasks that suffer from irreversible information loss during translation or compilation, such as hardware description recovery.
  • The adaptive fallback pattern offers a general template for balancing multiple objectives like semantics and syntax in AI code systems.
  • Extending evaluation beyond synthetic benchmarks to proprietary or legacy binaries would test whether the re-executability gains persist in production settings.

Load-bearing premise

The HumanEval-Decompile benchmark together with the hybrid verification strategy in DDPF accurately measures real-world semantic recovery and that observed gains are due to the proposed components rather than other training or inference choices.

What would settle it

Testing the trained CoDe-R model on a fresh collection of stripped real-world executables from open-source projects outside the benchmark and measuring whether average re-executability stays above 50%.

Figures

Figures reproduced from arXiv: 2604.12913 by Qiang Zhang, Zhongnian Li.

Figure 1
Figure 1. Figure 1: Comparison between existing methods and CoDe-R: While existing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Motivation Analysis (Baseline: LLM4Decompile-Ref-1.3B on [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of CoDe-R. The framework operates in two stages: Stage I employs SCE to train the model via rationale-conditional generation; Stage [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The running workflow of the DDPF mechanism. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simplified Prompt Template for Generator. We query the model to [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Re-executability Rate vs. Code Length: CoDe-R (Red) demonstrates [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

Binary decompilation is a critical reverse engineering task aimed at reconstructing high-level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from "logical hallucinations" and "semantic misalignment" due to the irreversible semantic loss during compilation, resulting in generated code that fails to re-execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe-R), a lightweight two-stage code refinement framework. The first stage introduces Semantic Cognitive Enhancement (SCE), a Rationale-Guided Semantic Injection strategy that trains the model to recover high-level algorithmic intent alongside code. The second stage introduces a Dynamic Dual-Path Fallback (DDPF) mechanism during inference, which adaptively balances semantic recovery and syntactic stability via a hybrid verification strategy. Evaluation on the HumanEval-Decompile benchmark demonstrates that CoDe-R (using a 1.3B backbone) establishes a new State-of-the-Art (SOTA) in the lightweight regime. Notably, it is the first 1.3B model to exceed an Average Re-executability Rate of 50.00%, significantly outperforming the baseline and effectively bridging the gap between efficient models and expert-level performance. Our code is available at https://github.com/Theaoi/CoDe-R.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes CoDe-R, a lightweight two-stage LLM-based framework for refining decompiler outputs from stripped binaries. Stage one applies Semantic Cognitive Enhancement (SCE) via rationale-guided semantic injection during training to recover high-level algorithmic intent. Stage two uses Dynamic Dual-Path Fallback (DDPF) at inference time with an adaptive hybrid verification strategy to balance semantic recovery against syntactic stability. On the HumanEval-Decompile benchmark, the 1.3B-parameter instantiation is reported to set a new SOTA in the lightweight regime and to be the first such model to exceed 50% Average Re-executability Rate.

Significance. If the empirical claims are substantiated, the work would be a useful engineering contribution to practical reverse-engineering tools by demonstrating that modest-sized models can be made substantially more effective at semantic recovery. The public release of code at https://github.com/Theaoi/CoDe-R is a clear strength that supports reproducibility and follow-on work.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: the headline SOTA claim (first 1.3B model >50% Average Re-executability Rate) is stated without any reported details on training data composition, exact metric definitions, statistical significance testing, or ablation controls that isolate the contributions of SCE and DDPF from other training or inference choices. This directly affects verifiability of the central empirical result.
  2. [Section 3.2] Section 3.2 (DDPF mechanism): the hybrid verification strategy is presented as balancing semantic recovery and syntactic stability, yet no quantitative analysis is supplied to demonstrate that observed gains reflect genuine functional equivalence rather than benchmark-specific artifacts or verification biases on HumanEval-Decompile. An ablation or sensitivity study addressing this attribution is required for the claim to be load-bearing.
minor comments (1)
  1. [Abstract] The abstract introduces several acronyms (SCE, DDPF) without a brief parenthetical expansion on first use; this is a minor clarity issue.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments point by point below. We agree that additional details and analyses will strengthen the verifiability of our claims and have planned revisions accordingly.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the headline SOTA claim (first 1.3B model >50% Average Re-executability Rate) is stated without any reported details on training data composition, exact metric definitions, statistical significance testing, or ablation controls that isolate the contributions of SCE and DDPF from other training or inference choices. This directly affects verifiability of the central empirical result.

    Authors: We acknowledge the referee's concern regarding the level of detail provided for the central empirical claims. While the Evaluation section describes the HumanEval-Decompile benchmark and reports the Average Re-executability Rate (defined as the fraction of generated decompiled functions that successfully re-execute against the original test suite), we agree that explicit details on training data composition, precise metric formulations, statistical testing, and isolating ablations are insufficiently highlighted. In the revised manuscript we will expand the Evaluation section with a dedicated paragraph on training data (synthetic pairs derived from HumanEval source with generated rationales), include bootstrap-based statistical significance results for the 50% threshold crossing, and add a consolidated ablation table that isolates SCE and DDPF from other training/inference choices. A concise reference to these elements will also be added to the abstract. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (DDPF mechanism): the hybrid verification strategy is presented as balancing semantic recovery and syntactic stability, yet no quantitative analysis is supplied to demonstrate that observed gains reflect genuine functional equivalence rather than benchmark-specific artifacts or verification biases on HumanEval-Decompile. An ablation or sensitivity study addressing this attribution is required for the claim to be load-bearing.

    Authors: We agree that the current description of the Dynamic Dual-Path Fallback (DDPF) mechanism in Section 3.2 would benefit from quantitative evidence that performance gains arise from improved functional equivalence rather than verification artifacts. The manuscript currently motivates the hybrid verification (execution-based semantic check combined with syntactic stability fallback) but does not report sensitivity or attribution studies. In the revision we will add an ablation subsection under Section 3.2 (or a new subsection in Evaluation) that quantifies: (i) the fraction of outputs passing execution-based verification versus syntactic-only checks, (ii) sensitivity of final re-executability to the adaptive threshold, and (iii) comparison against a non-adaptive baseline. We will also explicitly discuss potential HumanEval-Decompile-specific biases and how the dual-path design mitigates them. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering paper with no derivations or self-referential reductions

full rationale

The manuscript presents CoDe-R as a two-stage LLM-based refinement framework (SCE for rationale-guided semantic injection during training; DDPF for adaptive hybrid verification at inference) evaluated on the HumanEval-Decompile benchmark. No equations, closed-form derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The SOTA claim rests on reported re-executability rates rather than any chain that reduces to its own inputs by construction. This matches the default case of a self-contained empirical contribution with no circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of two newly introduced stages whose internal hyperparameters, training objectives, and verification thresholds are not detailed in the abstract. No explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.0 · 5535 in / 1171 out tokens · 25949 ms · 2026-05-10T14:33:49.631218+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 18 canonical work pages · 6 internal anchors

  1. [1]

    Cifuentes,Reverse compilation techniques

    C. Cifuentes,Reverse compilation techniques. Queensland University of Technology, Brisbane, 1994

  2. [2]

    Ida pro: a cross-platform multi-processor disassembler and debugger,

    Hex-Rays, “Ida pro: a cross-platform multi-processor disassembler and debugger,” https://hex-rays.com/ida-pro/, 2024

  3. [3]

    “Ghidra,” https://github.com/NationalSecurityAgency/ghidra, 2023

  4. [4]

    Analyzing memory accesses in x86 executables,

    G. Balakrishnan and T. Reps, “Analyzing memory accesses in x86 executables,” inInternational conference on compiler construction. Springer, 2004, pp. 5–23

  5. [5]

    Code Llama: Open Foundation Models for Code

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023

  6. [6]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024

  7. [7]

    Llm4decompile: Decom- piling binary code with large language models,

    H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decom- piling binary code with large language models,”arXiv preprint arXiv:2403.05286, 2024

  8. [8]

    Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,

    Y . Wang, X. Xu, X. Zhu, X. Gu, and B. Shen, “Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,” arXiv preprint arXiv:2509.14646, 2025

  9. [9]

    Sk2decompile: Llm-based two-phase binary decompilation from skele- ton to skin,

    H. Tan, W. Li, X. Tian, S. Wang, J. Liu, J. Li, and Y . Zhang, “Sk2decompile: Llm-based two-phase binary decompilation from skele- ton to skin,”arXiv preprint arXiv:2509.22114, 2025

  10. [10]

    Ref decompile: Relabeling and function call enhanced decompile,

    Y . Feng, B. Li, X. Shi, Q. Zhu, and W. Che, “Ref decompile: Relabeling and function call enhanced decompile,”arXiv preprint arXiv:2502.12221, 2025

  11. [11]

    Lost in the middle: How language models use long contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024

  12. [12]

    Semantic-aware source code modeling,

    Y . Ding, “Semantic-aware source code modeling,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 2494–2497

  13. [13]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837

  14. [14]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

  15. [15]

    Using recurrent neural networks for decompilation,

    D. S. Katz, J. Ruchti, and E. Schulte, “Using recurrent neural networks for decompilation,” in2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER). IEEE, 2018, pp. 346–356

  16. [16]

    Neural reverse engineering of stripped binaries using augmented control flow graphs,

    Y . David, U. Alon, and E. Yahav, “Neural reverse engineering of stripped binaries using augmented control flow graphs,”Proceedings of the ACM on Programming Languages, vol. 4, no. OOPSLA, pp. 1–28, 2020

  17. [17]

    The codeinverter suite: Control-flow and data- mapping augmented binary decompilation with llms,

    P. Liu, J. Sun, R. Sun, L. Chen, Z. Yan, P. Zhang, D. Sun, D. Wang, X. Zhang, and D. Li, “The codeinverter suite: Control-flow and data- mapping augmented binary decompilation with llms,”arXiv preprint arXiv:2503.07215, 2025

  18. [18]

    D-lift: Improving llm-based decompiler backend via code quality-driven fine-tuning,

    M. Zou, H. Cai, H. Wu, Z. L. Basque, A. Khan, B. Celik, A. Bianchi, D. Xuet al., “D-lift: Improving llm-based decompiler backend via code quality-driven fine-tuning,”arXiv preprint arXiv:2506.10125, 2025

  19. [19]

    Degpt: Optimizing decompiler output with llm,

    P. Hu, R. Liang, and K. Chen, “Degpt: Optimizing decompiler output with llm,” inNetwork and Distributed System Security Symposium, 2024

  20. [20]

    Refining decompiled c code with large language models,

    W. K. Wong, H. Wang, Z. Li, Z. Liu, S. Wang, Q. Tang, S. Nie, and S. Wu, “Refining decompiled c code with large language models,”arXiv preprint arXiv:2310.06530, 2023

  21. [21]

    Lmpa: Improving decompilation by synergy of large lan- guage model and program analysis,

    X. Xu, Z. Zhang, S. Feng, Y . Ye, Z. Su, N. Jiang, S. Cheng, L. Tan, and X. Zhang, “Lmpa: Improving decompilation by synergy of large lan- guage model and program analysis,”arXiv preprint arXiv:2306.02546, 2023

  22. [22]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luanet al., “Show your work: Scratchpads for intermediate computation with language models,”arXiv preprint arXiv:2112.00114, 2021

  23. [23]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,

    C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 8003–8017

  24. [24]

    Investigating mysteries of cot-augmented distillation,

    S. Wadhwa, S. Amir, and B. C. Wallace, “Investigating mysteries of cot-augmented distillation,”arXiv preprint arXiv:2406.14511, 2024

  25. [25]

    Codet: Code generation with generated tests,

    B. Chen, F. Zhang, A. Nguyen, Z. Da, S. R. Bowmanet al., “Codet: Code generation with generated tests,” inICLR, 2023

  26. [26]

    Self-refine: Iterative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 46 534–46 594

  27. [27]

    Compiler transformations for high-performance computing,

    D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations for high-performance computing,”ACM Computing Surveys (CSUR), vol. 26, no. 4, pp. 345–420, 1994

  28. [28]

    No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations

    K. Yakdan, S. Eschweiler, E. Gerhards-Padilla, and M. Smith, “No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations.” inNDSS, 2015

  29. [29]

    and Jaggi, M

    A. Mohtashami and M. Jaggi, “Landmark attention: Random-access in- finite context length for transformers,”arXiv preprint arXiv:2305.16300, 2023

  30. [30]

    Stanford alpaca: An instruction-following llama model,

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023

  31. [31]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv preprint physics/0004057, 2000

  32. [32]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  33. [33]

    Qwen3 technical report,

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,” 2025

  34. [34]

    Textbooks Are All You Need

    S. Gunasekar, Y . Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikiviet al., “Textbooks are all you need,”arXiv preprint arXiv:2306.11644, 2023

  35. [35]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  36. [36]

    arXiv preprint arXiv:2311.13721 (2023)

    N. Jiang, C. Wang, K. Liu, X. Xu, L. Tan, X. Zhang, and P. Babkin, “Nova: Generative language models for assembly code with hierarchical attention and contrastive learning,”arXiv preprint arXiv:2311.13721, 2023. APPENDIXA DETAILEDPROMPT FORRATIONALEGENERATOR The complete prompt template utilized for the Rationale Generator (M gen) is detailed below. This ...

  37. [37]

    Do not output the source code

    OutputONLYthe comment block. Do not output the source code

  38. [38]

    The comment must start with/ * and end with */

  39. [39]

    If the inner loop index initializes using the outer loop’s index (e.g.,inner = outerorinner = outer + 1), explicitly describe it as comparing”all pairs”or”combinations”

    Content: •Function: [Name] •Purpose: [Concise description] CRITICAL LOGIC CHECK (Must Follow): •Loop Analysis:Check how the inner loop initializes. If the inner loop index initializes using the outer loop’s index (e.g.,inner = outerorinner = outer + 1), explicitly describe it as comparing”all pairs”or”combinations”. STRICTLY FORBID the word ”adjacent” unl...

  40. [40]

    Original Ghidra Pseudo-Code (Inputx) undefined8 func0(float param_1,long param_2,int param_3) { int local_10; int local_c; local_10 = 0; do { local_c = local_10; if (param_3 <= local_10) { return 0; } while (local_c = local_c + 1, local_c < param_3) { if ((float)(DAT_001020d0 & (uint)(*(float *)(param_2 + (long)local_10 * 4) - *(float *)(param_2 + (long)l...

  41. [41]

    Returns 1 if any pair difference * (absolute value) is less than param_1; otherwise returns 0

    Generated Symbolic Rationale (Extractedz) /* * Function: func0 * Purpose: Compares pairs of float values from an array pointed to by param_2, * using a nested loop structure. Returns 1 if any pair difference * (absolute value) is less than param_1; otherwise returns 0. */

  42. [42]

    Baseline Output (Direct Translation - Compilation Fails) bool func0(float param_1, long param_2, int param_3) { for (int i = 0; i < param_3; i++) { for (int j = i + 1; j < param_3; j++) { // Logical & Syntax Error: Fails to restore fabsf and pointer types if ((float)(DAT_001020d0 & (uint)( *(float *)(param_2 + i * 4) - *(float *)(param_2 + j * 4))) < para...

  43. [43]

    CoDe-R Output (Rationale-Guided - Successfully Re-executes) bool func0(float *arr, int n, float eps) { int i, j; for (i = 0; i < n; i++) { for (j = i + 1; j < n; j++) { // Success: Perfectly restores fabsf and array indexing semantics if (fabsf(arr[i] - arr[j]) < eps) { return true; } } } return false; }