Understanding Binary Code Similarity for Real-World Vulnerability Detection: A Large-Scale Empirical Study
Pith reviewed 2026-06-30 09:50 UTC · model grok-4.3
The pith
Build-aware queries from real binaries raise BCSD mean reciprocal rank from 0.818 to 0.981 for firmware vulnerability detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Analysis of BCSD across diverse real firmware reveals that compilation toolchains and search space cause large performance variations; deriving queries from representative real-world binaries closes the gap and raises mean reciprocal rank from 0.818 to 0.981, while a TPL-aware two-stage search improves MRR by an additional 18.5 percent by restricting the search space.
What carries the argument
The build-aware query strategy, which selects query functions from binaries compiled under conditions matching the target firmware rather than from synthetic or mismatched sources.
If this is right
- Standard BCSD benchmarks that rely on non-representative queries systematically underestimate field performance.
- Incorporating knowledge of third-party libraries to limit search space yields consistent accuracy gains across different detection methods.
- Matching query and target binaries on compilation toolchain and build settings is required for reliable vulnerability ranking.
- Function size and version differences alone do not explain most observed performance drops once build awareness is added.
Where Pith is reading between the lines
- The same query-selection principle could be tested on other binary analysis tasks such as malware classification or patch identification.
- Detection pipelines might benefit from explicitly encoding build metadata as an auxiliary input rather than treating it as noise.
- Future large-scale studies could isolate the contribution of each factor by holding the others fixed in controlled subsets of the firmware corpus.
Load-bearing premise
The collection of 60,000 firmware images from 200 vendors supplies enough variety in vulnerabilities, third-party libraries, and compilation environments to support broad conclusions about BCSD behavior.
What would settle it
Running the same evaluation protocol on a fresh set of firmware images from additional vendors and measuring whether the reported MRR gains remain above 0.95 or fall closer to the baseline of 0.818.
Figures
read the original abstract
Firmware lies at the heart of IoT devices. Its development depends heavily on third-party libraries (TPLs), which greatly accelerate the process but simultaneously introduce associated vulnerabilities. Binary Code Similarity Detection (BCSD) is an effective technique for identifying vulnerabilities in firmware by comparing pairs of code segments. However, existing studies either evaluate their performance only on small-scale datasets or lack diversity in terms of vulnerabilities, TPLs, and firmware. Consequently, a comprehensive understanding of BCSD for real-world vulnerability detection remains absent. To bridge this gap, we conduct a large-scale study of vulnerability detection across 60,000 firmware images from 200 vendors using BCSD. Rather than introducing a novel model, we examine the influence of four key factors -- vulnerable function versions, vulnerability search space, function sizes, and compilation toolchains on BCSD performance. Our results reveal that these factors substantially affect performance, often by wide margins. To address this, we propose a build-aware query strategy that derives queries from representative real-world binaries, effectively closing the gap and raising the mean reciprocal rank (MRR) from 0.818 to 0.981. Furthermore, we demonstrate that a TPL-aware, two-stage search process significantly enhances accuracy, improving MRR by 18.5\% by limiting the search space.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a large-scale empirical study of Binary Code Similarity Detection (BCSD) for vulnerability detection in firmware. It evaluates BCSD performance across 60,000 firmware images from 200 vendors, examining the effects of four factors (vulnerable function versions, search space, function sizes, and compilation toolchains). The authors propose a build-aware query strategy that raises MRR from 0.818 to 0.981 and a TPL-aware two-stage search that improves MRR by 18.5%.
Significance. If the dataset is representative, the work provides useful empirical insights into real-world BCSD limitations and practical mitigation strategies, addressing the diversity shortcomings of prior smaller-scale studies. The scale of the corpus is a clear strength.
major comments (2)
- [Abstract] Abstract: the motivation criticizes prior studies for insufficient diversity in vulnerabilities, TPLs, and firmware, yet supplies no quantitative evidence (vendor distribution histograms, architecture coverage, TPL frequency counts, or labeling methodology) that the 60k corpus overcomes those limitations. This directly undermines the generalizability of the reported MRR gains.
- [Abstract] Abstract and experimental description: no details are given on baseline BCSD implementations, statistical significance tests, or curation/labeling procedures for the 60k dataset. These omissions make it impossible to assess whether the 0.818→0.981 and +18.5% improvements are robust or artifactual.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and transparency in the abstract and experimental sections. We address each point below and will revise the manuscript to incorporate additional details where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the motivation criticizes prior studies for insufficient diversity in vulnerabilities, TPLs, and firmware, yet supplies no quantitative evidence (vendor distribution histograms, architecture coverage, TPL frequency counts, or labeling methodology) that the 60k corpus overcomes those limitations. This directly undermines the generalizability of the reported MRR gains.
Authors: We agree that the abstract, due to length constraints, does not include quantitative summaries of dataset diversity. The full manuscript (Section 3) contains vendor distribution details across 200 vendors, architecture coverage (e.g., ARM, x86, MIPS), TPL frequency counts, and labeling methodology based on CVE matching and binary analysis. To strengthen the motivation and generalizability claims, we will revise the abstract to include concise quantitative evidence, such as the number of unique TPLs and architectures represented. revision: yes
-
Referee: [Abstract] Abstract and experimental description: no details are given on baseline BCSD implementations, statistical significance tests, or curation/labeling procedures for the 60k dataset. These omissions make it impossible to assess whether the 0.818→0.981 and +18.5% improvements are robust or artifactual.
Authors: The experimental section describes the BCSD tools and dataset construction at a high level, but we acknowledge that explicit details on baseline implementations (e.g., specific versions of tools like BinDiff or Asm2Vec), statistical significance testing for the MRR improvements, and expanded curation/labeling procedures (e.g., exact CVE-to-binary mapping steps) are not sufficiently elaborated. We will revise the experimental description to add these elements, including any applicable significance tests, to allow better assessment of robustness. revision: yes
Circularity Check
No circularity: purely empirical measurements on external corpus
full rationale
The paper reports an empirical large-scale study measuring BCSD performance factors (vulnerable function versions, search space, function sizes, toolchains) across 60k firmware images and then measures MRR gains from two proposed strategies (build-aware queries, TPL-aware two-stage search). These are direct experimental outcomes on held-out or representative binaries, not derivations, fitted parameters renamed as predictions, or self-citation chains. No equations, ansatzes, or uniqueness theorems appear; the MRR numbers (0.818→0.981, +18.5%) are observed deltas, not forced by construction. The representativeness concern raised by the skeptic is a validity/generalizability issue, not a circularity reduction. The work is self-contained against its own corpus benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Binary Code Similarity Detection (BCSD) is an effective technique for identifying vulnerabilities in firmware by comparing pairs of code segments.
Reference graph
Works this paper leans on
-
[1]
Nguyen, Kandaraj Piamrat, Guido Marchetto, and Quoc-Viet Pham
Ons Aouedi, Thai-Hoc Vu, Alessio Sacco, Dinh C. Nguyen, Kandaraj Piamrat, Guido Marchetto, and Quoc-Viet Pham. 2024. A Survey on Intelligent Internet of Things: Applications, Security, Privacy, and Future Directions.IEEE Communications Surveys & Tutorials(2024). doi:10.1109/COMST.2024.3430368
-
[2]
BusyBox. 2025. BusyBox: The Swiss Army Knife of Embedded Linux. https://www.busybox.net/
2025
-
[3]
Chen, Manuel Egele, Maverick Woo, and David Brumley
Daming D. Chen, Manuel Egele, Maverick Woo, and David Brumley. 2016. Towards Automated Dynamic Analysis for Linux-based Embedded Firmware. InProceedings of the 23rd Network and Distributed System Security Symposium , Vol. 1, No. 1, Article . Publication date: June 2026. Understanding Binary Code Similarity for Real-World Vulnerability Detection: A Large-S...
-
[4]
Andrei Costin, Jonas Zaddach, Aurélien Francillon, and Davide Balzarotti. 2014. A large-scale analysis of the security of embedded firmwares. InProceedings of the 23rd USENIX Conference on Security Symposium(San Diego, CA)(SEC’14). USENIX Association, USA, 95–110
2014
-
[5]
Andrei Costin, Apostolis Zarras, and Aurélien Francillon. 2016. Automated Dynamic Firmware Analysis at Scale: A Case Study on Embedded Web Interfaces. InProceedings of the 11th ACM on Asia Conference on Computer and Communications Security(Xi’an, China)(ASIA CCS ’16). Association for Computing Machinery, New York, NY, USA, 437–448. doi:10.1145/2897845.2897900
-
[6]
curl. 2025. curl: Command line tool and library for transferring data with URLs. https://curl.se/
2025
-
[7]
Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. InProceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation(Edinburgh, United Kingdom)(PLDI ’14). Association for Computing Machinery, New York, NY, USA, 349–360. doi:10.1145/2594291.2594343
-
[8]
Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2019. Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In2019 IEEE Symposium on Security and Privacy (SP). 472–489. doi:10.1109/SP.2019.00003
-
[9]
Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code. InNetwork and Distributed System Security Symposium. doi:10.14722/ndss.2016. 23185
-
[10]
Bo Feng, Alejandro Mera, and Long Lu. 2020. P2IM: scalable and hardware-independent firmware testing via automatic peripheral interface modeling. InProceedings of the 29th USENIX Conference on Security Symposium (SEC’20). USENIX Association, USA, Article 70, 18 pages. https://www.usenix.org/conference/usenixsecurity20/presentation/feng
2020
-
[11]
Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable Graph-based Bug Search for Firmware Images. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(Vienna, Austria)(CCS ’16). Association for Computing Machinery, New York, NY, USA, 480–491. doi:10. 1145/2976749.2978370
-
[12]
Fraunhofer SIT. 2019. FACT – Firmware Analysis and Comparison Tool: Documentation and Comparison Capabilities. https://fact-firmware-analysis.readthedocs.io/. Accessed 2025-09-12
2019
-
[13]
FreeType. 2025. FreeType: A Free, High-Quality and Portable Font Engine. https://freetype.org/
2025
-
[14]
Jian Gao, Xin Yang, Ying Fu, Yu Jiang, and Jiaguang Sun. 2018. VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering(Montpellier, France)(ASE ’18). Association for Computing Machinery, New York, NY, USA, 896–899. doi:10.1145/3238147.3240480
-
[15]
GNU Project. 2025. GNU Binutils. https://www.gnu.org/software/binutils/
2025
-
[16]
Google. 2011. BinDiff. https://www.zynamics.com/bindiff.html
2011
-
[17]
Irfan Ul Haq and Juan Caballero. 2021. A Survey of Binary Code Similarity.ACM Comput. Surv.54, 3, Article 51 (April 2021), 38 pages. doi:10.1145/3446371
-
[18]
Haojie He, Xingwei Lin, Ziang Weng, Ruijie Zhao, Shuitao Gan, Libo Chen, Yuede Ji, Jiashui Wang, and Zhi Xue
-
[19]
InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24)
Code is not Natural Language: Unlock the Power of Semantics-Oriented Graph Representation for Binary Code Similarity Detection. InProceedings of the 33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadelphia, PA, 1759–1776. https://www.usenix.org/conference/usenixsecurity24/presentation/he-haojie
-
[20]
2025.Binwalk: Firmware Analysis Tool
Craig Heffner and ReFirm Labs. 2025.Binwalk: Firmware Analysis Tool. https://github.com/ReFirmLabs/binwalk
2025
-
[21]
Grant Hernandez, Dave Jing Tian, Tuba Yavuz, Caroline Trippel, Kevin Butler, et al. 2022. FIRMWIRE: Transparent Dynamic Analysis for Cellular Baseband Firmware. InNetwork and Distributed System Security Symposium (NDSS). https://www.ndss-symposium.org/wp-content/uploads/2022-136-paper.pdf
2022
-
[22]
IBM. 2020. A new botnet attack just mozied into town. https://www.ibm.com/think/x-force/botnet-attack-mozi- mozied-into-town
2020
-
[23]
IBM. 2024. Firmware vs. software: What’s the difference and why it matters. https://www.ibm.com/think/insights/ firmware-vs-software
2024
-
[24]
Lichen Jia, Chenggang Wu, Peihua Zhang, and Zhe Wang. 2024. CodeExtract: Enhancing Binary Code Similarity Detection with Code Extraction Techniques. InProceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems(Copenhagen, Denmark)(LCTES 2024). Association for Computing Machinery, New York, N...
-
[25]
Dongkwan Kim, Eunsoo Kim, Sang Kil Cha, Sooel Son, and Yongdae Kim. 2023. Revisiting Binary Code Similarity Analysis Using Interpretable Feature Engineering and Lessons Learned.IEEE Transactions on Software Engineering49, 4 (2023), 1661–1682. doi:10.1109/TSE.2022.3187689
-
[26]
Wenqiang Li, Jiameng Shi, Fengjun Li, Jingqiang Lin, Wei Wang, and Le Guan. 2022. 𝜇𝐴𝐹 𝐿: Non-intrusive Feedback- driven Fuzzing for Microcontroller Firmware. In2022 IEEE/ACM 44th International Conference on Software Engineering , Vol. 1, No. 1, Article . Publication date: June 2026. 20 Jingdong Guo, Chaopeng Dong, Yimo Ren, Siyuan Li, Jie Liu, Hong Li, an...
-
[27]
Xuezixiang Li, Yu Qu, and Heng Yin. 2021. PalmTree: Learning an Assembly Language Model for Instruction Embedding. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS ’21). Association for Computing Machinery, New York, NY, USA, 3236–3251. doi:10.1145/3460120.3484587
-
[28]
Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. 2019. Graph Matching Networks for Learning the Similarity of Graph Structured Objects. InProceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 3835–3845. http...
2019
-
[29]
libexpat. 2025. Expat XML Parser Library. https://libexpat.github.io/
2025
-
[30]
libpng. 2025. libpng: The PNG Reference Library. http://www.libpng.org/pub/png/libpng.html
2025
-
[31]
LibTIFF. 2025. LibTIFF: TIFF Library and Utilities. http://www.simplesystems.org/libtiff/
2025
-
[32]
Zhenhao Luo, Pengfei Wang, Baosheng Wang, Yong Tang, Wei Xie, Xu Zhou, Danjun Liu, and Kai Lu. 2023. VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search. In30th Annual Network and Distributed System Security Symposium, NDSS 2023, San Diego, California, USA, February 27 - March 3, 2023. The Internet Society. doi:10.147...
-
[33]
Luca Massarelli, Giuseppe Antonio Di Luna, Fabio Petroni, Roberto Baldoni, and Leonardo Querzoni. 2019. SAFE: Self-Attentive Function Embeddings for Binary Similarity. InDetection of Intrusions and Malware, and Vulnerability Assessment - 16th International Conference, DIMV A 2019, Gothenburg, Sweden, June 19-20, 2019, Proceedings (Lecture Notes in Compute...
-
[34]
Marius Muench, Jan Stijohann, Frank Kargl, Aurélien Francillon, and Davide Balzarotti. 2018. What You Corrupt Is Not What You Crash: Challenges in Fuzzing Embedded Devices. InNetwork and Distributed System Security Symposium (NDSS). doi:10.14722/ndss.2018.23166
-
[35]
National Institute of Standards and Technology. 2014. CVE-2014-0160. https://nvd.nist.gov/vuln/detail/cve-2014-0160
2014
-
[36]
National Institute of Standards and Technology. 2025. National Vulnerability Database (NVD). https://nvd.nist.gov/
2025
-
[37]
OpenSSL. 2025. OpenSSL: Cryptography and SSL/TLS Toolkit. https://www.openssl.org/
2025
-
[38]
Kexin Pei, Zhou Xuan, Junfeng Yang, Suman Jana, and Baishakhi Ray. 2020. Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity.arXiv preprint arXiv:2012.08680(2020). doi:10.48550/arXiv.2012.08680
-
[39]
Nilo Redini, Aravind Machiry, Ruoyu Wang, Chad Spensky, Andrea Continella, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. 2020. Karonte: Detecting Insecure Multi-binary Interactions in Embedded Firmware. In 2020 IEEE Symposium on Security and Privacy (SP). 1544–1561. doi:10.1109/SP40000.2020.00036
-
[40]
Liting Ruan, Qizhen Xu, Shunzhi Zhu, Xujing Huang, and Xinyang Lin. 2024. A Survey of Binary Code Similarity Detection Techniques.Electronics13, 9 (2024). doi:10.3390/electronics13091715
-
[41]
Tobias Scharnowski, Nils Bars, Moritz Schloegel, Eric Gustafson, Marius Muench, Giovanni Vigna, Christopher Kruegel, Thorsten Holz, and Ali Abbasi. 2022. Fuzzware: Using Precise MMIO Modeling for Effective Firmware Fuzzing. In31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 1239–1256. https://www.usenix.org/conference/u...
2022
-
[42]
2024.Internet of Things (IoT) connected devices installed base worldwide from 2019 to 2030
Statista Research Department. 2024.Internet of Things (IoT) connected devices installed base worldwide from 2019 to 2030. Technical Report. Statista. Available at: https://www.statista.com/statistics/1183457/iot-connected-devices-worldwide/
-
[43]
Hao Wang, Zeyu Gao, Chao Zhang, Mingyang Sun, Yuchen Zhou, Han Qiu, and Xi Xiao. 2024. CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2024)(Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, New York, N...
- [44]
-
[45]
Hao Wang, Wenjie Qu, Gilad Katz, Wenyu Zhu, Zeyu Gao, Han Qiu, Jianwei Zhuge, and Chao Zhang. 2022. jTrans: jump-aware transformer for binary code similarity detection. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis(Virtual, South Korea)(ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 1–...
-
[46]
Haohuang Wen, Zhiqiang Lin, and Yinqian Zhang. 2020. FirmXRay: Detecting Bluetooth Link Layer Vulnerabilities From Bare-Metal Firmware. InProceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 167–180. doi:10.1145/3372297.3423344
-
[47]
Yuhao Wu, Jinwen Wang, Yujie Wang, Shixuan Zhai, Zihan Li, Yi He, Kun Sun, Qi Li, and Ning Zhang. 2024. Your Firmware Has Arrived: A Study of Firmware Update Vulnerabilities. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadelphia, PA, 5627–5644. https://www.usenix.org/conference/usenixsecurity24/ presentation/wu-yuhao , Vo...
2024
-
[48]
Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. 2017. Neural Network-Based Graph Embedding for Cross-Platform Binary Code Similarity Detection. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security(Dallas, Texas, USA)(CCS ’17). Association for Computing Machinery, New York, NY, USA, 363–376. doi:10.114...
-
[49]
Shouguo Yang, Long Cheng, Yicheng Zeng, Zhe Lang, Hongsong Zhu, and Zhiqiang Shi. 2021. Asteria: Deep Learning- based AST-Encoding for Cross-platform Binary Code Similarity Detection. In51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021, Taipei, Taiwan, June 21-24, 2021. IEEE, 224–236. doi:10. 1109/DSN48987.2021.00036
-
[50]
Shouguo Yang, Chaopeng Dong, Yang Xiao, Yiran Cheng, Zhiqiang Shi, Zhi Li, and Limin Sun. 2023. Asteria-Pro: Enhancing Deep Learning-based Binary Code Similarity Detection by Incorporating Domain Knowledge.ACM Trans. Softw. Eng. Methodol.33, 1, Article 1 (Nov. 2023), 40 pages. doi:10.1145/3604611
-
[51]
Jonas Zaddach, Luca Bruno, Aurélien Francillon, and Davide Balzarotti. 2014. AVATAR: A Framework to Support Dynamic Security Analysis of Embedded Systems’ Firmwares. In21st Annual Network and Distributed System Security Symposium, NDSS 2014, San Diego, California, USA, February 23-26, 2014. The Internet Society. https://doi.org/10. 14722/ndss.2014.23229
-
[52]
Binbin Zhao, Shouling Ji, Jiacheng Xu, Yuan Tian, Qiuyang Wei, Qinying Wang, Chenyang Lyu, Xuhong Zhang, Changting Lin, JingZheng Wu, and Raheem Beyah. 2022. A large-scale empirical analysis of the vulnerabilities introduced by third-party components in IoT firmware. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Ana...
-
[53]
Yaowen Zheng, Ali Davanian, Heng Yin, Chengyu Song, Hongsong Zhu, and Limin Sun. 2019. FIRM-AFL: High- Throughput Greybox Fuzzing of IoT Firmware via Augmented Process Emulation. In28th USENIX Security Symposium (USENIX Security 19). Santa Clara, CA, 1099–1114. https://www.usenix.org/conference/usenixsecurity19/presentation/ zheng
2019
-
[54]
zlib. 2025. zlib: A Massively Spiffy Yet Delicately Unobtrusive Compression Library. https://zlib.net/. , Vol. 1, No. 1, Article . Publication date: June 2026
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.