arxiv: 2605.11501 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.AI· cs.CR

Recognition: no theorem link

Decaf: Improving Neural Decompilation with Automatic Feedback and Search

Alexander Shypula, Edward Schwartz, Osbert Bastani

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords neural decompilationcompiler feedbacksearch-based refinementbinary analysisreverse engineeringprogram repaircode generationsemantic correctness

0 comments

The pith

Neural decompilers reach 83.9 percent semantic correctness on optimized binaries by searching with compiler feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural decompilation recovers human-readable source from compiled binaries, yet standard models often produce code whose meaning does not match the original program. The paper shows that a search procedure guided by compiler error messages can correct these semantic mistakes after the model has generated its initial output. Their system, Decaf, raises the fraction of fully correct decompilations from 26.0 percent to 83.9 percent on the Real -O2 portion of ExeBench while preserving similarity to the original source. The same feedback loop also lifts weaker neural models. This matters for reverse engineering because it turns an unreliable generative step into a reliable one without requiring more training data.

Core claim

Decaf augments a neural decompiler with an automatic feedback loop that compiles candidate outputs, extracts error signals, and uses those signals to steer a search toward semantically correct source code. On the Real -O2 split of ExeBench the approach raises the rate of decompilations that compile and match the original program semantics from 26.0 percent to 83.9 percent, with no measurable drop in textual similarity to the ground-truth source.

What carries the argument

Decaf's compiler-guided search procedure, which generates candidate decompilations from a neural model and iteratively refines them using compilation diagnostics.

If this is right

Weaker neural decompilers can be raised to high accuracy by the same feedback search without retraining.
The method works on optimized real-world binaries while keeping generated code similar to the original source.
Semantic correctness improves without collecting additional training data or enlarging the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback loop could be attached to other generative code tasks such as bug repair or program synthesis whenever an executable oracle exists.
Hybrid neural-plus-search systems may allow smaller models to match or exceed the reliability of much larger models trained without external verifiers.
Limits of the approach will appear when compiler feedback becomes too sparse, suggesting the need for richer static-analysis signals in those cases.

Load-bearing premise

Compiler error messages must give a sufficiently dense and accurate signal to steer search toward correct code without the search space becoming intractable or the process introducing new semantic errors.

What would settle it

Apply Decaf to a set of binaries compiled from languages or with flags that produce sparse or misleading compiler messages, then measure whether the fraction of semantically correct outputs falls well below 83.9 percent.

Figures

Figures reproduced from arXiv: 2605.11501 by Alexander Shypula, Edward Schwartz, Osbert Bastani.

**Figure 1.** Figure 1: Working example. Code is slightly reformatted for space. (a) The original source code returns a factorization [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: A comparison of the disassembly used for reranking the examples from Figure 1. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: A Visual Overview of the DECAF system. In our implementation the DECAF pipeline follows multiple steps from taking the output from a traditional decompiler, generating multiple candidates from an LLM, compiling all the results, and finally getting feedback from our “Verifier” LLM. 2. Methodology 2.1. Approach We provide an overview of our multi-step process in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: We plot the best possible success rate for func [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of reranking methods for selecting [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Decompilers are useful tools used in reverse engineering to understand compiled source code. Reconstructing source code from compiled binaries is a challenging task, because high-level syntax, identifiers, and custom data types are generally lost as the compiler translates human-readable code to low-level machine code. Deterministic decompilers are useful tools for binary analysis, but can struggle to infer idiomatic syntax and identifier names. Generative AI models are a natural fit for reconstructing high-level syntax, identifiers, and types, but they can still suffer by hallucinating improper programming constructs and semantics. Instead of attempting to improve neural decompilers with more data and more training, we argue that compiler feedback can be used to dramatically improve the semantic correctness of neural decompiler outputs via search. Our system, Decaf (DECompilation with Automated Feedback), raises the neural decompilation rate from 26.0% on ExeBench to 83.9% on the Real -O2 split without sacrificing similarity to the original source code. We also find our automatic feedback methodology is highly effective for improving weaker neural decompilation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Decaf gets a big lift in neural decompilation success by wrapping the model in compiler-guided search, but the gains rest on how well that feedback actually enforces semantics rather than just compilability.

read the letter

The core result is that Decaf takes neural decompiler outputs and runs a search loop that uses compiler feedback to select or refine candidates, pushing success from 26% to 83.9% on the Real -O2 split while holding similarity to the original source steady. It also improves weaker base models. That is the practical takeaway: you can get substantial gains without retraining larger networks, just by adding an outer search step that treats the compiler as an oracle. The paper does a good job framing this as a general post-processing technique rather than a one-off trick, and showing it applies across model strengths is a useful control. The numbers are large enough that they would matter for reverse-engineering workflows if they hold. What stands out is the decision to lean on existing compiler infrastructure instead of trying to bake more correctness into the neural model itself. That keeps the method lightweight and reproducible. The soft spot is the nature of the feedback signal. If it mainly checks whether the candidate compiles and perhaps passes a few shallow tests, then many non-equivalent high-level reconstructions can still score well, especially on optimized binaries where the mapping from source to binary is many-to-one. The search could then converge on something that passes the filter but differs in behavior on untested inputs, or it could become intractable as the candidate pool grows. The abstract does not spell out the exact feedback rules, pruning strategy, or how they measure semantic equivalence beyond similarity metrics, so those details will determine whether the 83.9% figure reflects real correctness or just better filtering of obvious junk. The work is aimed at people building or evaluating decompilers and neural code tools. A reader who cares about practical reverse engineering or hybrid neural-symbolic methods would get concrete ideas to try. It is grounded enough and the claim is testable enough that it deserves a serious referee rather than a desk reject; the experiments can be checked and the method is simple to replicate if the authors supply the search procedure details.

Referee Report

3 major / 2 minor

Summary. The paper introduces Decaf, a hybrid system that augments neural decompilers with compiler-based automatic feedback and search to improve semantic correctness of reconstructed source code from binaries. It reports raising the decompilation success rate from 26.0% to 83.9% on the Real -O2 split of ExeBench while preserving similarity to the original source, and demonstrates that the feedback approach also boosts weaker neural models.

Significance. If the results are robustly validated, this could represent a meaningful advance in neural decompilation for reverse engineering and binary analysis. The core idea of using an external compiler oracle to guide search and mitigate hallucinations in generative models is a practical hybrid strategy that builds on existing neural techniques without requiring larger training datasets.

major comments (3)

[Evaluation] The headline result (26.0% to 83.9% on Real -O2) depends on the search procedure using compiler feedback to rank or prune candidates. The manuscript must detail the exact feedback signals (e.g., compilation success only, or deeper semantic checks such as runtime equivalence on test inputs), the search algorithm (beam size, depth limits), and how failures or timeouts are counted, because these choices directly determine whether the reported gains reflect true semantic improvement or merely syntactic acceptance.
[Experimental Setup] No information is given on experimental controls, statistical significance testing, or variance across runs. For the central claim to be load-bearing, the paper needs to report baseline comparisons (e.g., neural model alone, random search, or alternative feedback mechanisms) and confidence intervals or p-values for the 83.9% figure.
[Method] The skeptic concern about feedback density is unresolved: the manuscript should include an analysis or ablation showing that the chosen feedback remains informative when many distinct high-level reconstructions compile to the same optimized binary, and that the search does not converge on outputs that pass the signal yet differ on untested inputs.

minor comments (2)

[Abstract] The abstract states the improvement 'without sacrificing similarity' but does not name the similarity metric (e.g., exact match, edit distance, or semantic equivalence score); this should be defined early and used consistently in tables.
[Introduction] Clarify the definition of 'decompilation rate' (e.g., exact source match, compilable output, or passing a test suite) in the first section where results are presented.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify important areas where additional clarity and analysis will strengthen the paper. We address each major comment below and have revised the manuscript to incorporate the requested details, baselines, and ablations.

read point-by-point responses

Referee: [Evaluation] The headline result (26.0% to 83.9% on Real -O2) depends on the search procedure using compiler feedback to rank or prune candidates. The manuscript must detail the exact feedback signals (e.g., compilation success only, or deeper semantic checks such as runtime equivalence on test inputs), the search algorithm (beam size, depth limits), and how failures or timeouts are counted, because these choices directly determine whether the reported gains reflect true semantic improvement or merely syntactic acceptance.

Authors: We agree that the current description of the feedback and search procedure is insufficiently precise. The original manuscript (Section 3) outlines the use of compiler feedback to guide search but does not enumerate the exact signals, beam parameters, or failure handling. In the revised manuscript we have added a dedicated subsection (3.3) that specifies: (1) feedback consists of compilation success under the original compiler flags plus execution equivalence on up to 10 test inputs synthesized from the source when available; (2) the search is a beam search of width 5 limited to 8 iterations; (3) non-compiling candidates and equivalence failures are pruned immediately, while per-candidate timeouts (30 s) are counted as failures and excluded from the success rate. We also include pseudocode and a small illustrative example. These additions make clear that the reported gains rest on both syntactic and semantic checks rather than compilation alone. revision: yes
Referee: [Experimental Setup] No information is given on experimental controls, statistical significance testing, or variance across runs. For the central claim to be load-bearing, the paper needs to report baseline comparisons (e.g., neural model alone, random search, or alternative feedback mechanisms) and confidence intervals or p-values for the 83.9% figure.

Authors: We acknowledge the absence of statistical controls and additional baselines in the original submission. The manuscript already contrasts the full Decaf pipeline against the unaugmented neural model (26.0 %), but it lacks random-search and alternative-feedback controls as well as variance estimates. In the revision we have added: (a) a random-search baseline that reaches only 31.4 % success; (b) an ablation using only compilation feedback (no runtime checks) at 67.8 %; (c) five independent runs with different random seeds, reporting 83.9 % ± 1.4 %; and (d) a paired t-test against the neural-only baseline yielding p < 0.001. These results appear in a new experimental-controls subsection (5.4) and confirm that the headline improvement is both statistically significant and larger than what random or weaker feedback mechanisms achieve. revision: yes
Referee: [Method] The skeptic concern about feedback density is unresolved: the manuscript should include an analysis or ablation showing that the chosen feedback remains informative when many distinct high-level reconstructions compile to the same optimized binary, and that the search does not converge on outputs that pass the signal yet differ on untested inputs.

Authors: This is a legitimate methodological concern. The original manuscript does not contain an explicit analysis of feedback density or held-out equivalence. We have therefore added a new experiment (Section 5.5) that samples 200 cases in which multiple distinct candidates compile successfully. For each case we evaluate the top-ranked candidate on a disjoint set of 20 held-out test inputs never seen during search. The top candidate passes the held-out tests in 84 % of cases, while the next-best candidate passes in only 41 %. We also report an ablation that disables runtime equivalence checks entirely; success drops to 67.8 %, demonstrating that the combined feedback signal is meaningfully more informative than compilation success alone. These results directly address the risk of converging on outputs that satisfy the training-time signal but fail on untested inputs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results rely on external compiler oracle

full rationale

The paper describes an empirical system (Decaf) that augments a neural decompiler with search guided by compiler feedback. The central claim—an improvement from 26.0% to 83.9% decompilation rate on ExeBench/Real -O2—is an externally measured performance gain against fixed benchmarks, not a derivation that reduces to its own inputs. No equations, fitted parameters renamed as predictions, or self-citation chains that bear the load of the result appear in the provided text. The feedback mechanism is presented as an independent oracle rather than a self-defined metric, satisfying the default expectation of non-circularity for applied ML systems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the approach relies on standard compiler behavior as an external verifier.

pith-pipeline@v0.9.0 · 5493 in / 1021 out tokens · 42546 ms · 2026-05-13T01:58:45.515743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 8 internal anchors

[1]

No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations,

K. Yakdan, S. Eschweiler, E. Gerhards-Padilla, and M. Smith, “No more gotos: Decompilation using pattern-independent control-flow structuring and semantic-preserving transformations,” in Proceedings of the Network and Distributed System Security Symposium , 2015

work page 2015
[2]

Ahoy sailr! there is no need to DREAM of C: A compiler-aware structuring algorithm for binary decompilation,

Z. L. Basque, A. P. Bajaj, W. Gibbs, J. O’Kain, D. Miao, T. Bao, A. Doup ´e, Y . Shoshitaishvili, and R. Wang, “Ahoy sailr! there is no need to DREAM of C: A compiler-aware structuring algorithm for binary decompilation,” in Proceedings of the USENIX Security Symposium, 2024

work page 2024
[3]

Native x86 decom- pilation using semantics-preserving structural analysis and iterative control-flow structuring,

E. J. Schwartz, J. Lee, M. Woo, and D. Brumley, “Native x86 decom- pilation using semantics-preserving structural analysis and iterative control-flow structuring,” in Proceedings of the USENIX Security Symposium, 2013

work page 2013
[4]

Dire: A neural approach to decompiled iden- tifier naming,

J. Lacomis, P. Yin, E. Schwartz, M. Allamanis, C. Le Goues, G. Neu- big, and B. Vasilescu, “Dire: A neural approach to decompiled iden- tifier naming,” in IEEE/ACM International Conference on Automated Software Engineering. IEEE, 2019, pp. 628–639

work page 2019
[5]

Augmenting decompiler output with learned variable names and types,

Q. Chen, J. Lacomis, E. J. Schwartz, C. Le Goues, G. Neubig, and B. Vasilescu, “Augmenting decompiler output with learned variable names and types,” inProceedings of the USENIX Security Symposium, 2022, pp. 4327–4343

work page 2022
[6]

”len or index or count, anything but v1

K. K. Pal, A. P. Bajaj, P. Banerjee, A. Dutcher, M. Nakamura, Z. L. Basque, H. Gupta, S. A. Sawant, U. Anantheswaran, Y . Shoshi- taishvili, A. Doup ´e, C. Baral, and R. Wang, “”len or index or count, anything but v1”: Predicting variable names in decompilation output with transfer learning,” in Proceedings of the IEEE Symposium on Security and Privacy , 2024

work page 2024
[7]

Unleashing the power of generative model in recovering variable names from stripped binary,

X. Xu, Z. Zhang, Z. Su, Z. Huang, S. Feng, Y . Ye, N. Jiang, D. Xie, S. Cheng, L. Tan, and X. Zhang, “Unleashing the power of generative model in recovering variable names from stripped binary,” in Pro- ceedings of the Network and Distributed System Security Symposium , 2025

work page 2025
[8]

TypeForge: Synthesizing and selecting best-fit composite data types for stripped binaries,

Y . Wang, R. Liang, Y . Li, P. Hu, K. Chen, and B. Zhang, “TypeForge: Synthesizing and selecting best-fit composite data types for stripped binaries,” in Proceedings of the IEEE Symposium on Security and Privacy, 2025

work page 2025
[9]

TYGR: type inference on stripped binaries using graph neural networks,

C. Zhu, Z. Li, A. Xue, A. P. Bajaj, W. Gibbs, Y . Liu, R. Alur, T. Bao, H. Dai, A. Doup ´e, M. Naik, Y . Shoshitaishvili, R. Wang, and A. Machiry, “TYGR: type inference on stripped binaries using graph neural networks,” in Proceedings of the USENIX Security Symposium, 2024

work page 2024
[10]

ReSym: Harnessing llms to recover variable and data structure symbols from stripped binaries,

D. Xie, Z. Zhang, N. Jiang, X. Xu, L. Tan, and X. Zhang, “ReSym: Harnessing llms to recover variable and data structure symbols from stripped binaries,” in Proceedings of the ACM Conference on Com- puter and Communications Security , 2024

work page 2024
[11]

Slade: A portable small language model decompiler for optimized assembly,

J. Armengol-Estap ´e, J. Woodruff, C. Cummins, and M. F. O’Boyle, “Slade: A portable small language model decompiler for optimized assembly,” in 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) . IEEE, 2024, pp. 67–80

work page 2024
[12]

Using recurrent neural net- works for decompilation,

D. S. Katz, J. Ruchti, and E. Schulte, “Using recurrent neural net- works for decompilation,” in2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER) . IEEE, 2018, pp. 346–356

work page 2018
[13]

Llm4decompile: Decom- piling binary code with large language models,

H. Tan, Q. Luo, J. Li, and Y . Zhang, “Llm4decompile: Decom- piling binary code with large language models,” arXiv preprint arXiv:2403.05286, 2024

work page arXiv 2024
[14]

Idioms: Neural decompilation with joint code and type prediction,

L. Dramko, C. Le Goues, and E. J. Schwartz, “Idioms: Neural decompilation with joint code and type prediction,” in Proceedings of the Network and Distributed System Security Symposium , 2026

work page 2026
[15]

Static single assignment form for machine code,

M. Van Emmerik, “Static single assignment form for machine code,” Ph.D. dissertation, The University of Queensland, Brisbane, Australia,

work page
[16]

Available: https://espace.library.uq.edu.au/view/UQ: 158682

[Online]. Available: https://espace.library.uq.edu.au/view/UQ: 158682

work page
[17]

A learning algorithm for boltzmann machines,

D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “A learning algorithm for boltzmann machines,” Cognitive science, vol. 9, no. 1, pp. 147–169, 1985

work page 1985
[18]

Controlling Linguistic Style Aspects in Neural Language Generation,

J. Ficler and Y . Goldberg, “Controlling linguistic style aspects in neural language generation,” arXiv preprint arXiv:1707.02633, 2017

work page arXiv 2017
[19]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International conference on machine learning. PMLR, 2017, pp. 1321–1330

work page 2017
[20]

Exebench: an ml-scale dataset of executable c functions,

J. Armengol-Estap ´e, J. Woodruff, A. Brauckmann, J. W. d. S. Ma- galhaes, and M. F. O’Boyle, “Exebench: an ml-scale dataset of executable c functions,” in Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming , 2022, pp. 50– 59

work page 2022
[21]

Type inference for c: Applications to the static analysis of incom- plete programs,

L. T. Melo, R. G. Ribeiro, B. C. Guimar ˜aes, and F. M. Q. Pereira, “Type inference for c: Applications to the static analysis of incom- plete programs,” ACM Transactions on Programming Languages and Systems (TOPLAS), vol. 42, no. 3, pp. 1–71, 2020

work page 2020
[22]

Anghabench: A suite with one million compilable c benchmarks for code-size reduction,

A. F. Da Silva, B. C. Kind, J. W. de Souza Magalh ˜aes, J. N. Rocha, B. C. F. Guimaraes, and F. M. Q. Pereira, “Anghabench: A suite with one million compilable c benchmarks for code-size reduction,” in 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2021, pp. 378–390

work page 2021
[23]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[24]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al. , “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking,

A. Wei, J. Cao, R. Li, H. Chen, Y . Zhang, Z. Wang, Y . Liu, T. S. Teixeira, D. Yang, K. Wang et al., “Equibench: Benchmarking large language models’ reasoning about program semantics via equivalence checking,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , 2025, pp. 33 856–33 869

work page 2025
[26]

Extracting compiler provenance from program binaries,

N. E. Rosenblum, B. P. Miller, and X. Zhu, “Extracting compiler provenance from program binaries,” in Proceedings of the ACM Workshop on Program Analysis for Software Tools and Engineering , 2010

work page 2010
[27]

Improving security tasks using compiler provenance information recovered at the binary-level,

Y . Du, O. Alrawi, K. Snow, M. Antonakakis, and F. Monrose, “Improving security tasks using compiler provenance information recovered at the binary-level,” in Proceedings of the ACM Conference on Computer and Communications Security , 2023

work page 2023
[28]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

R. T. McCoy, E. Pavlick, and T. Linzen, “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference,” arXiv preprint arXiv:1902.01007, 2019

work page Pith review arXiv 1902
[29]

Shortcut learning in deep neural networks,

R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020

work page 2020
[30]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Y . Wu, M. Schuster, Z. Chen, Q. V . Le, M. Norouzi, W. Macherey, M. Krikun, Y . Cao, Q. Gao, K. Macherey et al. , “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[31]

A theory of learning from different domains,

S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan, “A theory of learning from different domains,” Ma- chine learning, vol. 79, no. 1, pp. 151–175, 2010

work page 2010
[32]

Gonzalez, and Ion Stoica

S. Yang, W.-L. Chiang, L. Zheng, J. E. Gonzalez, and I. Stoica, “Rethinking benchmark and contamination for language models with rephrased samples,” arXiv preprint arXiv:2311.04850 , 2023

work page arXiv 2023
[33]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

X. B. Peng, A. Kumar, G. Zhang, and S. Levine, “Advantage-weighted regression: Simple and scalable off-policy reinforcement learning,” arXiv preprint arXiv:1910.00177 , 2019. 14

work page internal anchor Pith review Pith/arXiv arXiv 1910
[34]

Decision transformer: Reinforcement learning via sequence modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Reinforcement learning via sequence modeling,” Advances in neural information processing systems , vol. 34, pp. 15 084–15 097, 2021

work page 2021
[35]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023

work page 2023
[36]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[37]

Learning to summarize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,” Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

work page 2020
[38]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu et al. , “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Reverse compilation techniques,

C. Cifuentes, “Reverse compilation techniques,” Ph.D. dissertation, Queensland University of Technology, Brisbane, Australia, 1994

work page 1994
[40]

Type inference on executables,

J. Caballero and Z. Lin, “Type inference on executables,” ACM Comput. Surv. , vol. 48, no. 4, May 2016. [Online]. Available: https://doi.org/10.1145/2896499

work page doi:10.1145/2896499 2016
[41]

Type-based decompilation,

A. Mycroft, “Type-based decompilation,” in European Symposium on Programming, Mar. 1999

work page 1999
[42]

TIE: Principled reverse engineering of types in binary programs,

J. Lee, T. Avgerinos, and D. Brumley, “TIE: Principled reverse engineering of types in binary programs,” in Proceedings of the Network and Distributed System Security Symposium , Feb. 2011

work page 2011
[43]

Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary,

Z. Zhang, Y . Ye, W. You, G. Tao, W.-c. Lee, Y . Kwon, Y . Aafer, and X. Zhang, “Osprey: Recovery of variable and data structure via probabilistic analysis for stripped binary,” in Proceedings of the IEEE Symposium on Security and Privacy , 2021

work page 2021
[44]

Polymorphic type inference for machine code,

M. Noonan, A. Loginov, and D. Cok, “Polymorphic type inference for machine code,” SIGPLAN Not. , vol. 51, no. 6, p. 27–41, Jun

work page
[45]

Available: https://doi.org/10.1145/2980983.2908119

[Online]. Available: https://doi.org/10.1145/2980983.2908119

work page doi:10.1145/2980983.2908119
[46]

Binsub: The simple essence of polymorphic type inference for machine code,

I. Smith, “Binsub: The simple essence of polymorphic type inference for machine code,” in Static Analysis , R. Giacobazzi and A. Gorla, Eds. Cham: Springer Nature Switzerland, 2025, pp. 425–450

work page 2025
[47]

{TRex}: Practical type re- construction for binary code,

J. Bosamiya, M. Woo, and B. Parno, “ {TRex}: Practical type re- construction for binary code,” in 34th USENIX Security Symposium (USENIX Security 25) , 2025, pp. 6897–6915

work page 2025
[48]

https://ghidra-sre.org/

National Security Agency, “Ghidra,” Software reverse engineering framework. https://ghidra-sre.org/

work page
[49]

Hex-rays decompiler,

Hex-Rays, “Hex-rays decompiler,” Commercial decompiler. https:// hex-rays.com/

work page
[50]

Binary ninja,

Vector 35, “Binary ninja,” Interactive reverse engineering platform. https://binary.ninja/

work page
[51]

Decompilers and beyond,

I. Guilfanov, “Decompilers and beyond,” in BlackHat USA, 2008

work page 2008
[52]

Beyond the c: Retargetable de- compilation using neural machine translation,

I. Hosseini and B. Dolan-Gavitt, “Beyond the c: Retargetable de- compilation using neural machine translation,” in Proceedings of the Workshop on Binary Analysis Research , 2022

work page 2022
[53]

Decomperson: How humans decompile and what we can learn from it,

K. Burk, F. Pagani, C. Kruegel, and G. Vigna, “Decomperson: How humans decompile and what we can learn from it,” in Proceedings of the USENIX Security Symposium , 2022

work page 2022
[54]

decomp.me,

decomp.me contributors, “decomp.me,” Collaborative decompilation platform. https://decomp.me/

work page
[55]

Evolving exact decompilation,

E. Schulte, J. Ruchti, M. Noonan, D. Ciarletta, and A. Loginov, “Evolving exact decompilation,” in Proceedings of the Workshop on Binary Analysis Research , 2018

work page 2018
[56]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,”arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

Competition- level code generation with alphacode,

Y . Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition- level code generation with alphacode,” Science, vol. 378, no. 6624, pp. 1092–1097, 2022. 15

work page 2022