Recognition: unknown
CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference
Pith reviewed 2026-05-10 14:33 UTC · model grok-4.3
The pith
CoDe-R refines LLM decompiler output with rationale guidance and adaptive fallback to exceed 50% re-executability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoDe-R establishes that rationale-guided semantic injection during training combined with adaptive dual-path fallback during inference enables lightweight LLMs to reduce logical hallucinations and semantic misalignment in decompilation, producing the first 1.3B model to exceed 50% average re-executability rate on HumanEval-Decompile while outperforming prior lightweight approaches.
What carries the argument
The Semantic Cognitive Enhancement (SCE) rationale-guided injection strategy paired with the Dynamic Dual-Path Fallback (DDPF) adaptive inference mechanism that uses hybrid verification.
If this is right
- Lightweight models can now reach performance levels previously limited to much larger models on code recovery tasks.
- The hybrid verification approach may transfer to improve reliability in other LLM-based code generation settings.
- The gap between efficient models and expert-level decompilation performance narrows without needing to increase model size.
- Open availability of the method supports wider use in practical reverse engineering pipelines.
Where Pith is reading between the lines
- Rationale guidance techniques could apply to other LLM tasks that suffer from irreversible information loss during translation or compilation, such as hardware description recovery.
- The adaptive fallback pattern offers a general template for balancing multiple objectives like semantics and syntax in AI code systems.
- Extending evaluation beyond synthetic benchmarks to proprietary or legacy binaries would test whether the re-executability gains persist in production settings.
Load-bearing premise
The HumanEval-Decompile benchmark together with the hybrid verification strategy in DDPF accurately measures real-world semantic recovery and that observed gains are due to the proposed components rather than other training or inference choices.
What would settle it
Testing the trained CoDe-R model on a fresh collection of stripped real-world executables from open-source projects outside the benchmark and measuring whether average re-executability stays above 50%.
Figures
read the original abstract
Binary decompilation is a critical reverse engineering task aimed at reconstructing high-level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from "logical hallucinations" and "semantic misalignment" due to the irreversible semantic loss during compilation, resulting in generated code that fails to re-execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe-R), a lightweight two-stage code refinement framework. The first stage introduces Semantic Cognitive Enhancement (SCE), a Rationale-Guided Semantic Injection strategy that trains the model to recover high-level algorithmic intent alongside code. The second stage introduces a Dynamic Dual-Path Fallback (DDPF) mechanism during inference, which adaptively balances semantic recovery and syntactic stability via a hybrid verification strategy. Evaluation on the HumanEval-Decompile benchmark demonstrates that CoDe-R (using a 1.3B backbone) establishes a new State-of-the-Art (SOTA) in the lightweight regime. Notably, it is the first 1.3B model to exceed an Average Re-executability Rate of 50.00%, significantly outperforming the baseline and effectively bridging the gap between efficient models and expert-level performance. Our code is available at https://github.com/Theaoi/CoDe-R.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes CoDe-R, a lightweight two-stage LLM-based framework for refining decompiler outputs from stripped binaries. Stage one applies Semantic Cognitive Enhancement (SCE) via rationale-guided semantic injection during training to recover high-level algorithmic intent. Stage two uses Dynamic Dual-Path Fallback (DDPF) at inference time with an adaptive hybrid verification strategy to balance semantic recovery against syntactic stability. On the HumanEval-Decompile benchmark, the 1.3B-parameter instantiation is reported to set a new SOTA in the lightweight regime and to be the first such model to exceed 50% Average Re-executability Rate.
Significance. If the empirical claims are substantiated, the work would be a useful engineering contribution to practical reverse-engineering tools by demonstrating that modest-sized models can be made substantially more effective at semantic recovery. The public release of code at https://github.com/Theaoi/CoDe-R is a clear strength that supports reproducibility and follow-on work.
major comments (2)
- [Abstract and Evaluation section] Abstract and Evaluation section: the headline SOTA claim (first 1.3B model >50% Average Re-executability Rate) is stated without any reported details on training data composition, exact metric definitions, statistical significance testing, or ablation controls that isolate the contributions of SCE and DDPF from other training or inference choices. This directly affects verifiability of the central empirical result.
- [Section 3.2] Section 3.2 (DDPF mechanism): the hybrid verification strategy is presented as balancing semantic recovery and syntactic stability, yet no quantitative analysis is supplied to demonstrate that observed gains reflect genuine functional equivalence rather than benchmark-specific artifacts or verification biases on HumanEval-Decompile. An ablation or sensitivity study addressing this attribution is required for the claim to be load-bearing.
minor comments (1)
- [Abstract] The abstract introduces several acronyms (SCE, DDPF) without a brief parenthetical expansion on first use; this is a minor clarity issue.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive comments on our manuscript. We address each of the major comments point by point below. We agree that additional details and analyses will strengthen the verifiability of our claims and have planned revisions accordingly.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: the headline SOTA claim (first 1.3B model >50% Average Re-executability Rate) is stated without any reported details on training data composition, exact metric definitions, statistical significance testing, or ablation controls that isolate the contributions of SCE and DDPF from other training or inference choices. This directly affects verifiability of the central empirical result.
Authors: We acknowledge the referee's concern regarding the level of detail provided for the central empirical claims. While the Evaluation section describes the HumanEval-Decompile benchmark and reports the Average Re-executability Rate (defined as the fraction of generated decompiled functions that successfully re-execute against the original test suite), we agree that explicit details on training data composition, precise metric formulations, statistical testing, and isolating ablations are insufficiently highlighted. In the revised manuscript we will expand the Evaluation section with a dedicated paragraph on training data (synthetic pairs derived from HumanEval source with generated rationales), include bootstrap-based statistical significance results for the 50% threshold crossing, and add a consolidated ablation table that isolates SCE and DDPF from other training/inference choices. A concise reference to these elements will also be added to the abstract. revision: yes
-
Referee: [Section 3.2] Section 3.2 (DDPF mechanism): the hybrid verification strategy is presented as balancing semantic recovery and syntactic stability, yet no quantitative analysis is supplied to demonstrate that observed gains reflect genuine functional equivalence rather than benchmark-specific artifacts or verification biases on HumanEval-Decompile. An ablation or sensitivity study addressing this attribution is required for the claim to be load-bearing.
Authors: We agree that the current description of the Dynamic Dual-Path Fallback (DDPF) mechanism in Section 3.2 would benefit from quantitative evidence that performance gains arise from improved functional equivalence rather than verification artifacts. The manuscript currently motivates the hybrid verification (execution-based semantic check combined with syntactic stability fallback) but does not report sensitivity or attribution studies. In the revision we will add an ablation subsection under Section 3.2 (or a new subsection in Evaluation) that quantifies: (i) the fraction of outputs passing execution-based verification versus syntactic-only checks, (ii) sensitivity of final re-executability to the adaptive threshold, and (iii) comparison against a non-adaptive baseline. We will also explicitly discuss potential HumanEval-Decompile-specific biases and how the dual-path design mitigates them. revision: yes
Circularity Check
No circularity: empirical engineering paper with no derivations or self-referential reductions
full rationale
The manuscript presents CoDe-R as a two-stage LLM-based refinement framework (SCE for rationale-guided semantic injection during training; DDPF for adaptive hybrid verification at inference) evaluated on the HumanEval-Decompile benchmark. No equations, closed-form derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described method. The SOTA claim rests on reported re-executability rates rather than any chain that reduces to its own inputs by construction. This matches the default case of a self-contained empirical contribution with no circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Cifuentes,Reverse compilation techniques
C. Cifuentes,Reverse compilation techniques. Queensland University of Technology, Brisbane, 1994
1994
-
[2]
Ida pro: a cross-platform multi-processor disassembler and debugger,
Hex-Rays, “Ida pro: a cross-platform multi-processor disassembler and debugger,” https://hex-rays.com/ida-pro/, 2024
2024
-
[3]
“Ghidra,” https://github.com/NationalSecurityAgency/ghidra, 2023
2023
-
[4]
Analyzing memory accesses in x86 executables,
G. Balakrishnan and T. Reps, “Analyzing memory accesses in x86 executables,” inInternational conference on compiler construction. Springer, 2004, pp. 5–23
2004
-
[5]
Code Llama: Open Foundation Models for Code
B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code llama: Open foundation models for code,”arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Liet al., “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,”arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review arXiv 2024
-
[7]
Llm4decompile: Decom- piling binary code with large language models,
H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decom- piling binary code with large language models,”arXiv preprint arXiv:2403.05286, 2024
-
[8]
Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,
Y . Wang, X. Xu, X. Zhu, X. Gu, and B. Shen, “Salt4decompile: Inferring source-level abstract logic tree for llm-based binary decompilation,” arXiv preprint arXiv:2509.14646, 2025
-
[9]
Sk2decompile: Llm-based two-phase binary decompilation from skele- ton to skin,
H. Tan, W. Li, X. Tian, S. Wang, J. Liu, J. Li, and Y . Zhang, “Sk2decompile: Llm-based two-phase binary decompilation from skele- ton to skin,”arXiv preprint arXiv:2509.22114, 2025
-
[10]
Ref decompile: Relabeling and function call enhanced decompile,
Y . Feng, B. Li, X. Shi, Q. Zhu, and W. Che, “Ref decompile: Relabeling and function call enhanced decompile,”arXiv preprint arXiv:2502.12221, 2025
-
[11]
Lost in the middle: How language models use long contexts,
N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long contexts,” Transactions of the association for computational linguistics, vol. 12, pp. 157–173, 2024
2024
-
[12]
Semantic-aware source code modeling,
Y . Ding, “Semantic-aware source code modeling,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 2494–2497
2024
-
[13]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,” inAdvances in Neural Information Processing Systems, vol. 35, 2022, pp. 24 824–24 837
2022
-
[14]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Using recurrent neural networks for decompilation,
D. S. Katz, J. Ruchti, and E. Schulte, “Using recurrent neural networks for decompilation,” in2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER). IEEE, 2018, pp. 346–356
2018
-
[16]
Neural reverse engineering of stripped binaries using augmented control flow graphs,
Y . David, U. Alon, and E. Yahav, “Neural reverse engineering of stripped binaries using augmented control flow graphs,”Proceedings of the ACM on Programming Languages, vol. 4, no. OOPSLA, pp. 1–28, 2020
2020
-
[17]
The codeinverter suite: Control-flow and data- mapping augmented binary decompilation with llms,
P. Liu, J. Sun, R. Sun, L. Chen, Z. Yan, P. Zhang, D. Sun, D. Wang, X. Zhang, and D. Li, “The codeinverter suite: Control-flow and data- mapping augmented binary decompilation with llms,”arXiv preprint arXiv:2503.07215, 2025
-
[18]
D-lift: Improving llm-based decompiler backend via code quality-driven fine-tuning,
M. Zou, H. Cai, H. Wu, Z. L. Basque, A. Khan, B. Celik, A. Bianchi, D. Xuet al., “D-lift: Improving llm-based decompiler backend via code quality-driven fine-tuning,”arXiv preprint arXiv:2506.10125, 2025
-
[19]
Degpt: Optimizing decompiler output with llm,
P. Hu, R. Liang, and K. Chen, “Degpt: Optimizing decompiler output with llm,” inNetwork and Distributed System Security Symposium, 2024
2024
-
[20]
Refining decompiled c code with large language models,
W. K. Wong, H. Wang, Z. Li, Z. Liu, S. Wang, Q. Tang, S. Nie, and S. Wu, “Refining decompiled c code with large language models,”arXiv preprint arXiv:2310.06530, 2023
-
[21]
Lmpa: Improving decompilation by synergy of large lan- guage model and program analysis,
X. Xu, Z. Zhang, S. Feng, Y . Ye, Z. Su, N. Jiang, S. Cheng, L. Tan, and X. Zhang, “Lmpa: Improving decompilation by synergy of large lan- guage model and program analysis,”arXiv preprint arXiv:2306.02546, 2023
-
[22]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
M. Nye, A. J. Andreassen, G. Gur-Ari, H. Michalewski, J. Austin, D. Bieber, D. Dohan, A. Lewkowycz, M. Bosma, D. Luanet al., “Show your work: Scratchpads for intermediate computation with language models,”arXiv preprint arXiv:2112.00114, 2021
work page internal anchor Pith review arXiv 2021
-
[23]
Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,
C.-Y . Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y . Fujii, A. Ratner, R. Kr- ishna, C.-Y . Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” inFindings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 8003–8017
2023
-
[24]
Investigating mysteries of cot-augmented distillation,
S. Wadhwa, S. Amir, and B. C. Wallace, “Investigating mysteries of cot-augmented distillation,”arXiv preprint arXiv:2406.14511, 2024
-
[25]
Codet: Code generation with generated tests,
B. Chen, F. Zhang, A. Nguyen, Z. Da, S. R. Bowmanet al., “Codet: Code generation with generated tests,” inICLR, 2023
2023
-
[26]
Self-refine: Iterative refinement with self-feedback,
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iterative refinement with self-feedback,” inAdvances in Neural Information Processing Systems, vol. 36, 2023, pp. 46 534–46 594
2023
-
[27]
Compiler transformations for high-performance computing,
D. F. Bacon, S. L. Graham, and O. J. Sharp, “Compiler transformations for high-performance computing,”ACM Computing Surveys (CSUR), vol. 26, no. 4, pp. 345–420, 1994
1994
-
[28]
No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations
K. Yakdan, S. Eschweiler, E. Gerhards-Padilla, and M. Smith, “No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations.” inNDSS, 2015
2015
-
[29]
A. Mohtashami and M. Jaggi, “Landmark attention: Random-access in- finite context length for transformers,”arXiv preprint arXiv:2305.16300, 2023
-
[30]
Stanford alpaca: An instruction-following llama model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” 2023
2023
-
[31]
The information bottleneck method
N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv preprint physics/0004057, 2000
work page Pith review arXiv 2000
-
[32]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[33]
Qwen3 technical report,
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,” 2025
2025
-
[34]
S. Gunasekar, Y . Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikiviet al., “Textbooks are all you need,”arXiv preprint arXiv:2306.11644, 2023
work page internal anchor Pith review arXiv 2023
-
[35]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
2002
-
[36]
arXiv preprint arXiv:2311.13721 (2023)
N. Jiang, C. Wang, K. Liu, X. Xu, L. Tan, X. Zhang, and P. Babkin, “Nova: Generative language models for assembly code with hierarchical attention and contrastive learning,”arXiv preprint arXiv:2311.13721, 2023. APPENDIXA DETAILEDPROMPT FORRATIONALEGENERATOR The complete prompt template utilized for the Rationale Generator (M gen) is detailed below. This ...
-
[37]
Do not output the source code
OutputONLYthe comment block. Do not output the source code
-
[38]
The comment must start with/ * and end with */
-
[39]
If the inner loop index initializes using the outer loop’s index (e.g.,inner = outerorinner = outer + 1), explicitly describe it as comparing”all pairs”or”combinations”
Content: •Function: [Name] •Purpose: [Concise description] CRITICAL LOGIC CHECK (Must Follow): •Loop Analysis:Check how the inner loop initializes. If the inner loop index initializes using the outer loop’s index (e.g.,inner = outerorinner = outer + 1), explicitly describe it as comparing”all pairs”or”combinations”. STRICTLY FORBID the word ”adjacent” unl...
-
[40]
Original Ghidra Pseudo-Code (Inputx) undefined8 func0(float param_1,long param_2,int param_3) { int local_10; int local_c; local_10 = 0; do { local_c = local_10; if (param_3 <= local_10) { return 0; } while (local_c = local_c + 1, local_c < param_3) { if ((float)(DAT_001020d0 & (uint)(*(float *)(param_2 + (long)local_10 * 4) - *(float *)(param_2 + (long)l...
-
[41]
Returns 1 if any pair difference * (absolute value) is less than param_1; otherwise returns 0
Generated Symbolic Rationale (Extractedz) /* * Function: func0 * Purpose: Compares pairs of float values from an array pointed to by param_2, * using a nested loop structure. Returns 1 if any pair difference * (absolute value) is less than param_1; otherwise returns 0. */
-
[42]
Baseline Output (Direct Translation - Compilation Fails) bool func0(float param_1, long param_2, int param_3) { for (int i = 0; i < param_3; i++) { for (int j = i + 1; j < param_3; j++) { // Logical & Syntax Error: Fails to restore fabsf and pointer types if ((float)(DAT_001020d0 & (uint)( *(float *)(param_2 + i * 4) - *(float *)(param_2 + j * 4))) < para...
-
[43]
CoDe-R Output (Rationale-Guided - Successfully Re-executes) bool func0(float *arr, int n, float eps) { int i, j; for (i = 0; i < n; i++) { for (j = i + 1; j < n; j++) { // Success: Perfectly restores fabsf and array indexing semantics if (fabsf(arr[i] - arr[j]) < eps) { return true; } } } return false; }
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.