pith. sign in

arxiv: 2605.21615 · v1 · pith:SXZ7DJHGnew · submitted 2026-05-20 · 💻 cs.CR · cs.LG· cs.SE

ASSEMBLAGE-DEEPHISTORY: A Cross-Build Binary Dataset with Temporal Coverage

Pith reviewed 2026-05-22 09:36 UTC · model grok-4.3

classification 💻 cs.CR cs.LGcs.SE
keywords binary datasetvulnerability detectioncross-build analysissoftware historyCVE labelingmachine learning security
0
0 comments X

The pith

ASSEMBLAGE-DEEPHISTORY provides a single database linking 73,610 binaries to their source code, compilation details, historical versions, and CVE-labeled vulnerable functions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents a new dataset that brings together binaries compiled in many different ways, from different versions of software over time, and marked with known security issues. Existing collections usually miss at least one of these elements, making it hard to study how binaries change or how detection tools hold up across variations. A reader might care because this setup lets researchers test whether AI models truly understand binary vulnerabilities or just memorize patterns from specific builds. The dataset stores all the context as searchable metadata so one can query across builds and history easily. Analyses using large language models, embedding comparisons, and statistical regression illustrate how the structure supports practical work on binary similarity and vulnerability reasoning.

Core claim

The paper establishes ASSEMBLAGE-DEEPHISTORY as a consolidated dataset of 73,610 binaries from 248 open-source projects. These binaries come from GCC, Clang, and MSVC compilers at various optimization levels on Linux and Windows, including multi-year historical builds. Each entry connects to its source code, functions, debug information, other build variants, past versions, and functions known to be vulnerable.

What carries the argument

The queryable database structure that treats compilation context, source code, vulnerable functions, and package version as first-class metadata for every binary.

If this is right

  • LLMs can be tested in stages for recognizing vulnerabilities, using strategy guidance, and transferring detection across different builds.
  • Embedding methods like MalConv and jTrans can be compared on how well they group binaries from the same package versions.
  • Binary similarity can be broken down into effects from time between versions, code changes, and commit activity using Bayesian methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This structure could let developers train more robust vulnerability detectors that ignore irrelevant build differences.
  • Future work might track how specific vulnerabilities appear and disappear across software releases using the historical links.
  • Security researchers could use the cross-platform builds to study compiler-specific weaknesses in a controlled way.

Load-bearing premise

The three provided analyses sufficiently prove the dataset's value for practical tasks without requiring further large-scale tests or outside benchmarks.

What would settle it

A demonstration that the LLM benchmark results do not reflect true reasoning or that the clustering and regression fail to distinguish meaningful patterns would undermine the dataset's claimed utility.

Figures

Figures reproduced from arXiv: 2605.21615 by Chang Liu, Edward Raff, James Holt, Kristopher Micinski, Nicol\`o Altamura, Noah Fleischmann.

Figure 1
Figure 1. Figure 1: Three-Stage CVE Evaluation Design 25% of the resulting records to verify that each CVE matches the correct library and version in our dataset (manually inspected CVE IDs available in appendix). For each CVE, we chose a reference binary with lowest optimization and grouped other affected binaries to one of five Diff categories: Optimization, Compiler, OS, Version and All (per-category counts in the appendix… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-build transfer based on Qwen-3.6 agent. Each panel plots Hit@ [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Package-level binary similarity on ASSEMBLAGE-DEEPHISTORY’s ≥ 2-version subset (ELF + PE combined). From left to right: MalConv embedding cosine similarity, PE only; MalConv embedding cosine similarity, all packages-mean; jTrans embedding cosine similarity, all packages￾mean; TLSH fuzzy-hash similarity, all packages-mean. −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Logit impact (95% HDI) file change commits days bias… view at source ↗
Figure 4
Figure 4. Figure 4: Global coefficient posterior means and 95% HDIs for MalConv cosine similarity, jTrans [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Eval 3 cross-build transfer comparison. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
read the original abstract

Existing binary corpora typically capture only one or two axes of binary variation: they either provide cross-compiler builds without a temporal axis, or CVE labels for single-build binaries. None combine cross-build diversity, cross-version history, and CVE labels into a queryable structure. We present ASSEMBLAGE-DEEPHISTORY, which consolidates these dimensions into a unified framework where every binary's compilation context, source code, vulnerable functions, and package version are stored as first-class metadata. ASSEMBLAGE-DEEPHISTORY comprises 73,610 binaries spanning 248 open-source projects, compiled across GCC, Clang, and MSVC at multiple optimization levels on Linux and Windows, with multi-year historical builds. Each binary is indexed in a database that links it to its source code, functions, debug info, variant builds, historical versions, and vulnerable functions. Three analyses demonstrate this structure's value: (1) a three-stage LLM benchmark (recognition, strategy-guided detection, and cross-build transfer) to test whether LLMs reason about binary vulnerabilities or pattern-match on build-specific artifacts; (2) a comparison of MalConv embeddings, jTrans function embeddings, and TLSH fuzzy hashes quantifying how same-package versions cluster in each space; and (3) a Bayesian regression decomposing binary similarity into contributions from temporal distance, file changes, and commits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents ASSEMBLAGE-DEEPHISTORY, a dataset of 73,610 binaries spanning 248 open-source projects. It unifies cross-compiler builds (GCC, Clang, MSVC at multiple optimization levels on Linux and Windows), multi-year historical versions, and CVE labels into a single queryable database. Every binary is linked as first-class metadata to its source code, functions, debug information, variant builds, historical versions, and vulnerable functions. Value is shown via three internal analyses: a three-stage LLM benchmark (recognition, strategy-guided detection, cross-build transfer), embedding clustering comparisons (MalConv, jTrans, TLSH) on same-package versions, and Bayesian regression decomposing similarity into temporal distance, file changes, and commit factors.

Significance. If the dataset construction details and analysis results hold, the work supplies a useful resource for binary vulnerability research by consolidating axes of variation previously available only in isolation. The database indexing of compilation context, source links, and CVE labels is a concrete strength that could support new queries. No machine-checked proofs or parameter-free derivations are present, but the reproducible corpus structure itself is a positive contribution for the field.

major comments (1)
  1. [Analyses section (corresponding to the three analyses described after the dataset construction)] The section describing the three analyses: these demonstrations remain entirely internal to the new corpus and quantify structure (e.g., clustering behavior or factor decomposition) without a controlled external comparison showing measurable gains on downstream tasks such as cross-build vulnerability transfer or historical CVE localization relative to existing single-axis corpora. This leaves the central claim that the unified metadata framework enables new reasoning capabilities resting on an unverified assumption.
minor comments (2)
  1. [Dataset description] Clarify the exact number of variants per project and the distribution across compilers/optimizations in the dataset statistics table; current high-level aggregates make reproducibility checks harder.
  2. [Analysis 1 and Analysis 3] Add error bars or ablation details to the LLM benchmark results and the Bayesian regression coefficients to strengthen the quantitative claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review of our manuscript on ASSEMBLAGE-DEEPHISTORY. We address the single major comment below and indicate where revisions have been made to the manuscript.

read point-by-point responses
  1. Referee: [Analyses section (corresponding to the three analyses described after the dataset construction)] The section describing the three analyses: these demonstrations remain entirely internal to the new corpus and quantify structure (e.g., clustering behavior or factor decomposition) without a controlled external comparison showing measurable gains on downstream tasks such as cross-build vulnerability transfer or historical CVE localization relative to existing single-axis corpora. This leaves the central claim that the unified metadata framework enables new reasoning capabilities resting on an unverified assumption.

    Authors: We appreciate the referee's observation that the three analyses are conducted internally to the corpus. The intent of these demonstrations is to illustrate the novel analytical capabilities unlocked by unifying cross-build, temporal, and CVE metadata in a single queryable structure—capabilities that cannot be exercised on existing single-axis corpora. For instance, the LLM cross-build transfer stage directly tests whether models exploit build-specific artifacts, which requires the multi-compiler and multi-version axes we provide. The embedding clustering and Bayesian regression similarly decompose effects across temporal distance and build variants in ways prior datasets do not support. We acknowledge, however, that explicit head-to-head performance gains on downstream tasks such as vulnerability detection accuracy would provide additional external validation. In the revised manuscript we have added a dedicated limitations and future-work subsection that (a) contrasts the query expressiveness of ASSEMBLAGE-DEEPHISTORY with prior corpora and (b) outlines controlled external benchmarks that the community can now perform using the released dataset. This revision clarifies the scope of our current claims while preserving the paper's focus on the dataset itself. revision: partial

Circularity Check

0 steps flagged

No circularity: dataset release with internal utility analyses remains self-contained

full rationale

The paper presents ASSEMBLAGE-DEEPHISTORY as a new corpus that unifies cross-build, temporal, and CVE metadata, then illustrates its structure via three analyses performed directly on the released binaries (LLM tasks, embedding clustering, Bayesian regression). These steps quantify internal properties of the corpus but introduce no equations, fitted parameters renamed as predictions, or self-citation chains that reduce the dataset claim to its own inputs. The central contribution is the construction and indexing of the data itself; the analyses serve as descriptive benchmarks rather than derivations whose outputs are forced by construction. No load-bearing uniqueness theorems or ansatzes from prior author work are invoked to justify the framework.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution rests on standard assumptions about accurate vulnerability labeling and representative compilation settings rather than new invented entities or fitted parameters.

axioms (2)
  • domain assumption Compilation contexts using GCC, Clang, and MSVC at multiple optimization levels on Linux and Windows accurately capture real-world binary variation.
    Invoked when describing the 73,610 binaries spanning multiple compilers and platforms.
  • domain assumption Vulnerable functions can be reliably identified and linked to binaries via debug info and source code.
    Required for the CVE labels to serve as ground truth in the LLM benchmark and other analyses.

pith-pipeline@v0.9.0 · 5791 in / 1354 out tokens · 40706 ms · 2026-05-22T09:36:42.801003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · 4 internal anchors

  1. [1]

    https: / / github.com/nationalsecurityagency/ghidra

    National Security Agency.Ghidra Software Reverse Engineering Framework. https: / / github.com/nationalsecurityagency/ghidra. accessed 2026-05-06. 2019

  2. [2]

    SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection

    Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, and Song Wang. SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection. 2025.URL: https://arxiv.org/abs/2505.19828

  3. [3]

    Assessing the Effectiveness of the Tigress Obfuscator Against MOPSA and BinaryNinja

    Nicolò Altamura, Enrico Bragastini, Marco Campion, and Mila Dalla Preda. “Assessing the Effectiveness of the Tigress Obfuscator Against MOPSA and BinaryNinja”. In:Proceedings of the 2025 Workshop on Research on Offensive and Defensive Techniques in the Context of Man At The End (MATE) Attacks. 2025.URL:https://doi.org/10.1145/3733817.3762702

  4. [4]

    EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

    H. Anderson and Phil Roth. “EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models”. In:ArXiv(2018).URL: https://api.semanticscholar.org/ CorpusID:4888440

  5. [5]

    Apple Newsroom

    Apple.Apple debuts M5 Pro and M5 Max to supercharge the most demanding pro work- flows. Apple Newsroom. Accessed: 2026-05-06. 2026.URL: https://www.apple.com/ newsroom/2026/03/apple- debuts- m5- pro- and- m5- max- to- supercharge- the- most-demanding-pro-workflows/

  6. [6]

    BinPool: A Dataset of Vulnerabilities for Binary Security Analysis

    Sima Arasteh, Georgios Nikitopoulos, Wei-Cheng Wu, Nicolaas Weideman, Aaron Portnoy, Mukund Raghothaman, and Christophe Hauser. “BinPool: A Dataset of Vulnerabilities for Binary Security Analysis”. In:Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering. 2025

  7. [7]

    Polyglot and Distributed Software Repository Mining with Crossflow

    Konstantinos Barmpis, Patrick Neubauer, Jonathan Co, Dimitris Kolovos, Nicholas Matragkas, and Richard F. Paige. “Polyglot and Distributed Software Repository Mining with Crossflow”. In:Proceedings of the 17th International Conference on Mining Software Repositories. 2020. URL:https://doi.org/10.1145/3379597.3387481

  8. [8]

    Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation

    Zion Leonahenahe Basque, Ati Priya Bajaj, Wil Gibbs, Jude O’Kain, Derron Miao, Tiffany Bao, Adam Doupé, Yan Shoshitaishvili, and Ruoyu Wang. “Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation”. In: 33rd USENIX Security Symposium (USENIX Security 24). 2024.URL: https://www.usenix. org/conference/use...

  9. [9]

    CVEfixes: automated collection of vulner- abilities and their fixes from open-source software

    Guru Bhandari, Amara Naseer, and Leon Moonen. “CVEfixes: automated collection of vulner- abilities and their fixes from open-source software”. In:Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering. 2021.URL: http://dx.doi.org/10.1145/3475960.3475985

  10. [10]

    Syntia: Synthe- sizing the semantics of obfuscated code

    Tim Blazytko, Moritz Contag, Cornelius Aschermann, and Thorsten Holz. “Syntia: Synthe- sizing the semantics of obfuscated code”. In:26th USENIX Security Symposium (USENIX Security 17). 2017. 10

  11. [11]

    The tigress c diversifier/obfuscator

    Christian Collberg, Sam Martin, Jonathan Myers, Bill Zimmerman, Petr Krajca, Gabriel Kerneis, Saumya Debray, and Babak Yadegari. “The tigress c diversifier/obfuscator”. In: Retrieved August(2015)

  12. [12]

    Christian Collberg, Clark Thomborson, and Douglas Low.A taxonomy of obfuscating transfor- mations. 1997

  13. [13]

    BinBench: a benchmark for x64 portable operating system interface binary function represen- tations

    Francesca Console, Giuseppe D’Aquanno, Giuseppe Antonio Di Luna, and Leonardo Querzoni. “BinBench: a benchmark for x64 portable operating system interface binary function represen- tations”. In:PeerJ Computer Science(2023).URL: https://api.semanticscholar.org/ CorpusID:259029804

  14. [14]

    EM- BERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis

    Dragos Georgian Corlatescu, Alexandru Dinu, Mihaela Gaman, and Paul Sumedrea. “EM- BERSim: A Large-Scale Databank for Boosting Similarity Search in Malware Analysis”. In: ArXiv(2023).URL:https://api.semanticscholar.org/CorpusID:263608542

  15. [15]

    RISC-V Instruction Set Architecture Extensions: A Survey

    Enfang Cui, Tianzheng Li, and Qian Wei. “RISC-V Instruction Set Architecture Extensions: A Survey”. In:IEEE Access(2023)

  16. [16]

    https://www.cve.org/

    CVE Program.Common Vulnerabilities and Exposures (CVE). https://www.cve.org/ . Accessed: 2026-05-06

  17. [17]

    https://ai.google.dev/gemma/docs/core/ model_card_4

    Google DeepMind.Gemma 4 model card. https://ai.google.dev/gemma/docs/core/ model_card_4. accessed 2026-05-20. 2026

  18. [18]

    Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization

    Steven HH Ding, Benjamin CM Fung, and Philippe Charland. “Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization”. In:2019 ieee symposium on security and privacy (sp). 2019

  19. [19]

    Vulnerability detection with code language models: How far are we?

    Yangruibo Ding, Yanjun Fu, Omniyyah Ibrahim, Chawin Sitawarin, Xinyun Chen, Basel Alomair, David Wagner, Baishakhi Ray, and Yizheng Chen. “Vulnerability Detection with Code Language Models: How Far Are We?” In:arXiv preprint arXiv:2403.18624(2024)

  20. [20]

    LibvDiff: Library Version Difference Guided OSS Version Identification in Binaries

    Chaopeng Dong, Siyuan Li, Shouguo Yang, Yang Xiao, Yongpan Wang, Hong Li, Zhi Li, and Limin Sun. “LibvDiff: Library Version Difference Guided OSS Version Identification in Binaries”. In:Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 2024.URL:https://doi.org/10.1145/3597503.3623336

  21. [21]

    Schwartz.Idioms: Neural Decompilation With Joint Code and Type Definition Prediction

    Luke Dramko, Claire Le Goues, and Edward J. Schwartz.Idioms: Neural Decompilation With Joint Code and Type Definition Prediction. 2025.URL: https://arxiv.org/abs/2502. 04536

  22. [22]

    Identifying Open- Source License Violation and 1-day Security Risk at Large Scale

    Ruian Duan, Ashish Bijlani, Meng Xu, Taesoo Kim, and Wenke Lee. “Identifying Open- Source License Violation and 1-day Security Risk at Large Scale”. In:Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017.URL: https://doi.org/10.1145/3133956.3134048

  23. [23]

    DeepBinDiff: Learning Program- Wide Code Representations for Binary Diffing

    Yue Duan, Xuezixiang Li, Jinghan Wang, and Heng Yin. “DeepBinDiff: Learning Program- Wide Code Representations for Binary Diffing”. In:27th Annual Network and Distributed Sys- tem Security Symposium, NDSS 2020, San Diego, California, USA, February 23-26, 2020. 2020. URL: https://www.ndss- symposium.org/ndss- paper/deepbindiff- learning- program-wide-code-...

  24. [24]

    A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries

    Jiahao Fan, Yi Li, Shaohua Wang, and Tien N. Nguyen. “A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries”. In:Proceedings of the 17th International Confer- ence on Mining Software Repositories. 2020.URL: https://doi.org/10.1145/3379597. 3387501

  25. [25]

    Scalable graph-based bug search for firmware images

    Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. “Scalable graph-based bug search for firmware images”. In:Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 2016

  26. [26]

    Structural comparison of executable objects

    Halvar Flake. “Structural comparison of executable objects”. In:Detection of intrusions and malware & vulnerability assessment, GI SIG SIDAR workshop, DIMVA 2004. 2004

  27. [27]

    BinHunt: Automatically Finding Semantic Differences in Binary Programs

    Debin Gao, Michael K. Reiter, and Dawn Song. “BinHunt: Automatically Finding Semantic Differences in Binary Programs”. In:Information and Communications Security: 10th Interna- tional Conference, ICICS 2008 Birmingham, UK, October 20 - 22, 2008 Proceedings. 2008. URL:https://doi.org/10.1007/978-3-540-88625-9_16. 11

  28. [28]

    SigmaDiff: Semantics-Aware Deep Graph Matching for Pseudocode Diffing

    Lian Gao, Yu Qu, Sheng Yu, Yue Duan, and Heng Yin. “SigmaDiff: Semantics-Aware Deep Graph Matching for Pseudocode Diffing”. In:Proceedings 2024 Network and Distributed Sys- tem Security Symposium(2024).URL: https://api.semanticscholar.org/CorpusID: 262144278

  29. [29]

    Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper)

    Andrew Gelman. “Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper)”. In:Bayesian Analysis(2006).URL: https://doi.org/ 10.1214/06-BA117A

  30. [30]

    Why We (Usually) Don’t Have to Worry About Multiple Comparisons

    Andrew Gelman, Jennifer Hill, and Masanao Yajima. “Why We (Usually) Don’t Have to Worry About Multiple Comparisons”. In:Journal of Research on Educational Effectiveness(2012). URL:https://doi.org/10.1080/19345747.2011.618213

  31. [31]

    Inference from iterative simulation using multiple sequences

    Andrew Gelman and Donald B Rubin. “Inference from iterative simulation using multiple sequences”. In:Statistical science(1992)

  32. [32]

    Accessed: 2026- 05-06

    GitHub.GitHub Advisory Database.https://github.com/advisories. Accessed: 2026- 05-06

  33. [33]

    The GHTorent dataset and tool suite

    Georgios Gousios. “The GHTorent dataset and tool suite”. In:2013 10th Working Conference on Mining Software Repositories (MSR). 2013

  34. [34]

    BinProv: Binary Code Provenance Identification without Disassembly

    Xu He, Shu Wang, Yunlong Xing, Pengbin Feng, Haining Wang, Qi Li, Songqing Chen, and Kun Sun. “BinProv: Binary Code Provenance Identification without Disassembly”. In: Proceedings of the 25th International Symposium on Research in Attacks, Intrusions and Defenses(2022).URL:https://api.semanticscholar.org/CorpusID:252910574

  35. [35]

    The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo

    Matthew D. Hoffman and Andrew Gelman. “The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo”. In:J. Mach. Learn. Res.(2011).URL: https: //api.semanticscholar.org/CorpusID:12948548

  36. [36]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. “RULER: What’s the real context size of your long-context language models?” In:arXiv preprint arXiv:2404.06654(2024)

  37. [37]

    2020.URL: https://github

    Zecong Hu and Jeremy Lacomis.GitHub Cloner & Compiler. 2020.URL: https://github. com/huzecong/ghcc

  38. [38]

    2025.URL:https://arxiv.org/abs/2505.22010

    Nasir Hussain, Haohan Chen, Chanh Tran, Philip Huang, Zhuohao Li, Pravir Chugh, William Chen, Ashish Kundu, and Yuan Tian.VulBinLLM: LLM-powered Vulnerability Detection for Stripped Binaries. 2025.URL:https://arxiv.org/abs/2505.22010

  39. [39]

    BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

    Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, and Yuqun Zhang. BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching. 2024.URL:https://arxiv.org/abs/2401.11161

  40. [40]

    2025.URL:https://arxiv.org/abs/2311.13721

    Nan Jiang, Chengxiao Wang, Kevin Liu, Xiangzhe Xu, Lin Tan, Xiangyu Zhang, and Petr Babkin.Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning. 2025.URL:https://arxiv.org/abs/2311.13721

  41. [41]

    Joyce, Dev Amlani, Charles Nicholas, and Edward Raff.MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels

    Robert J. Joyce, Dev Amlani, Charles Nicholas, and Edward Raff.MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. 2021.URL: https://arxiv.org/abs/ 2111.15031

  42. [42]

    EMBER2024 - A Benchmark Dataset for Holistic Evaluation of Malware Classifiers

    Robert J. Joyce, Gideon Miller, Phil Roth, Richard Zak, Elliott Zaresky-Williams, Hyrum Anderson, Edward Raff, and James Holt. “EMBER2024 - A Benchmark Dataset for Holistic Evaluation of Malware Classifiers”. In:Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V .2. 2025.URL: http://dx.doi.org/10.1145/ 3711896.3737431

  43. [43]

    Obfuscator-LLVM – Software Protection for the Masses

    Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. “Obfuscator-LLVM – Software Protection for the Masses”. In:2015 IEEE/ACM 1st International Workshop on Software Protection. 2015

  44. [44]

    Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned

    Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. “Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned”. In: IEEE Transactions on Software Engineering(2023).URL: http://dx.doi.org/10.1109/ TSE.2022.3187689

  45. [45]

    Joxean Koret.Diaphora.https://github.com/joxeankoret/diaphora. 12

  46. [46]

    Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan

    John Kruschke. “Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan”. In: (2014)

  47. [47]

    URLhttps://openreview.net/forum?id=VTF8yNQM66

    Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, and Lingming Zhang.SEC-bench: Automated Bench- marking of LLM Agents on Real-World Software Security Tasks. 2025.URL: https://arxiv. org/abs/2506.11791

  48. [48]

    2025.URL:https://arxiv.org/abs/2506.05692

    Xinghang Li, Jingzhe Ding, Chao Peng, Bing Zhao, Xiang Gao, Hongwan Gao, and Xinchen Gu.SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM- Generated Code. 2025.URL:https://arxiv.org/abs/2506.05692

  49. [49]

    PalmTree: Learning an Assembly Language Model for Instruction Embedding

    Xuezixiang Li, Yu Qu, and Heng Yin. “PalmTree: Learning an Assembly Language Model for Instruction Embedding”. In:Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.URL: http : / / dx . doi . org / 10 . 1145 / 3460120 . 3484587

  50. [50]

    Mining Internet-Scale Software Repositories

    Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, and Pierre Baldi. “Mining Internet-Scale Software Repositories”. In:Advances in Neural Information Processing Systems. 2007.URL: https://proceedings.neurips.cc/paper_files/paper/2007/file/ a532400ed62e772b9dc0b86f46e583ff-Paper.pdf

  51. [51]

    α Diff: Cross-Version Binary Code Similarity Detection with DNN

    Bingchang Liu, Wei Huo, Chao Zhang, Wenchao Li, Feng Li, Aihua Piao, and Wei Zou. “α Diff: Cross-Version Binary Code Similarity Detection with DNN”. In:2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 2018

  52. [52]

    2024.URL:https://arxiv.org/abs/2405.03991

    Chang Liu, Rebecca Saul, Yihao Sun, Edward Raff, Maya Fuchs, Townsend Southard Pantano, James Holt, and Kristopher Micinski.Assemblage: Automatic Binary Dataset Construction for Machine Learning. 2024.URL:https://arxiv.org/abs/2405.03991

  53. [53]

    2026.URL:https://arxiv.org/abs/2603.28002

    Chang Liu, Yihao Sun, Thomas Gilray, and Kristopher Micinski.Superset Decompilation. 2026.URL:https://arxiv.org/abs/2603.28002

  54. [54]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. “Lost in the middle: How language models use long contexts”. In:Transac- tions of the association for computational linguistics(2024)

  55. [55]

    2026.URL: https : //arxiv.org/abs/2602.06687

    Li Lu, Yanjie Zhao, Hongzhou Rao, Kechi Zhang, and Haoyu Wang.Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models. 2026.URL: https : //arxiv.org/abs/2602.06687

  56. [56]

    How Machine Learning Is Solving the Binary Function Similarity Problem

    Andrea Marcelli, Mariano Graziano, Xabier Ugarte-Pedrero, Yanick Fratantonio, Mohamad Mansouri, and Davide Balzarotti. “How Machine Learning Is Solving the Binary Function Similarity Problem”. In:31st USENIX Security Symposium (USENIX Security 22). 2022. URL: https://www.usenix.org/conference/usenixsecurity22/presentation/ marcelli

  57. [57]

    2019.URL: https: //arxiv.org/abs/1811.05296

    Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Leonardo Querzoni, and Roberto Baldoni.SAFE: Self-Attentive Function Embeddings for Binary Similarity. 2019.URL: https: //arxiv.org/abs/1811.05296

  58. [58]

    Equation of state calculations by fast computing machines

    Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. “Equation of state calculations by fast computing machines”. In:The journal of chemical physics(1953)

  59. [59]

    Microsoft.vcpkg.https://github.com/microsoft/vcpkg. 2024

  60. [60]

    https: //nvd.nist.gov

    National Institute of Standards and Technology.National Vulnerability Database. https: //nvd.nist.gov. Accessed: 2026-05-06

  61. [61]

    MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations

    Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, and Shaohua Wang. “MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representations”. In:2024 IEEE/ACM 21st International Conference on Mining Software Repositories (MSR). 2024

  62. [62]

    TLSH–a locality sensitive hash

    Jonathan Oliver, Chun Cheng, and Yanggui Chen. “TLSH–a locality sensitive hash”. In:2013 fourth cybercrime and trustworthy computing workshop. 2013

  63. [63]

    Composable Effects for Flexible and Accelerated Probabilistic Programming in NumPyro

    Du Phan, Neeraj Pradhan, and Martin Jankowiak.Composable Effects for Flexible and Ac- celerated Probabilistic Programming in NumPyro. 2019.URL: https://arxiv.org/abs/ 1912.11554

  64. [64]

    Anderson, Bobby Filar, and Mark McLean.Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

    Edward Raff, William Fleshman, Richard Zak, Hyrum S. Anderson, Bobby Filar, and Mark McLean.Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection. 2020.URL:https://arxiv.org/abs/2012.09390. 13

  65. [65]

    CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco, and Shuai Ma.CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. 2020.URL:https://arxiv.org/abs/2009.10297

  66. [66]

    Riddell, A

    Martin Riddell, Ansong Ni, and Arman Cohan.Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models. 2024.URL: https://arxiv.org/abs/ 2403.04811

  67. [67]

    VulZoo: A Comprehensive Vulnerability Intelligence Dataset

    Bonan Ruan, Jiahao Liu, Weibo Zhao, and Zhenkai Liang. “VulZoo: A Comprehensive Vulnerability Intelligence Dataset”. In:Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2024.URL:https://doi.org/10.1145/ 3691620.3695345

  68. [68]

    Symbolic deobfuscation: From virtualized code back to the original

    Jonathan Salwan, Sébastien Bardin, and Marie-Laure Potet. “Symbolic deobfuscation: From virtualized code back to the original”. In:International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. 2018

  69. [69]

    Is Function Similarity Over-Engineered? Building a Benchmark

    Rebecca Saul, Chang Liu, Noah Fleischmann, Richard Zak, Kristopher Micinski, Edward Raff, and James Holt. “Is Function Similarity Over-Engineered? Building a Benchmark”. In: Advances in Neural Information Processing Systems. 2024

  70. [70]

    Loki: Hardening code obfuscation against automated at- tacks

    Moritz Schloegel, Tim Blazytko, Moritz Contag, Cornelius Aschermann, Julius Basler, Thorsten Holz, and Ali Abbasi. “Loki: Hardening code obfuscation against automated at- tacks”. In:31st USENIX Security Symposium (USENIX Security 22). 2022

  71. [71]

    paper2repo: GitHub Repository Recommendation for Academic Papers

    Huajie Shao, Dachun Sun, Jiahao Wu, Zecheng Zhang, Aston Zhang, Shuochao Yao, Shengzhong Liu, Tianshi Wang, Chao Zhang, and Tarek Abdelzaher. “paper2repo: GitHub Repository Recommendation for Academic Papers”. In:Proceedings of The Web Conference

  72. [72]

    2020.URL:http://dx.doi.org/10.1145/3366423.3380145

  73. [73]

    Ubuntu One investigation: Detecting evidences on client machines

    Mohammad Behnam Shariati, Ali Dehghantanha, Ben Martini, and Kim-Kwang Raymond Choo. “Ubuntu One investigation: Detecting evidences on client machines”. In:The Cloud Secu- rity Ecosystem. 2015.URL:https://api.semanticscholar.org/CorpusID:33377904

  74. [74]

    SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis

    Yan Shoshitaishvili, Ruoyu Wang, Christopher Salls, Nick Stephens, Mario Polino, Andrew Dutcher, John Grosen, Siji Feng, Christophe Hauser, Christopher Kruegel, and Giovanni Vigna. “SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis”. In:2016 IEEE Symposium on Security and Privacy (SP). 2016

  75. [75]

    2026.URL:https://arxiv.org/abs/2603.18355

    Ashwin Sudhir, Zion Leonahenahe Basque, Wil Gibbs, Ati Priya Bajaj, Pulkit Singh Singaria, Mitchell Zakocs, Jie Hu, Moritz Schloegel, Tiffany Bao, Adam Doupe, Yan Shoshitaishvili, and Ruoyu Wang.Pushan: Trace-Free Deobfuscation of Virtualization-Obfuscated Binaries. 2026.URL:https://arxiv.org/abs/2603.18355

  76. [76]

    Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang.LLM4Decompile: Decompiling Binary Code with Large Language Models. 2024

  77. [77]

    2025.URL:https://arxiv.org/abs/2505.12668

    Hanzhuo Tan, Xiaolong Tian, Hanrui Qi, Jiaming Liu, Zuchen Gao, Siyi Wang, Qi Luo, Jing Li, and Yuqun Zhang.Decompile-Bench: Million-Scale Binary-Source Function Pairs for Real-World Binary Decompilation. 2025.URL:https://arxiv.org/abs/2505.12668

  78. [78]

    Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks, 2024

    Saad Ullah, Mingji Han, Saurabh Pujar, Hammond Pearce, Ayse Coskun, and Gianluca Stringhini.LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks. 2024.URL: https://arxiv.org/ abs/2312.12575

  79. [79]

    Angr - The Next Generation of Binary Analysis

    Fish Wang and Yan Shoshitaishvili. “Angr - The Next Generation of Binary Analysis”. In: 2017 IEEE Cybersecurity Development (SecDev). 2017

  80. [80]

    2022.URL: https: //arxiv.org/abs/2205.12713

    Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang.jTrans: Jump-Aware Transformer for Binary Code Similarity. 2022.URL: https: //arxiv.org/abs/2205.12713

Showing first 80 references.