pith. sign in

arxiv: 2604.19496 · v1 · submitted 2026-04-21 · 💻 cs.CR

EvoPatch-IoT: Evolution-Aware Cross-Architecture Vulnerability Retrieval and Patch-State Profiling for BusyBox-Based IoT Firmware

Pith reviewed 2026-05-10 02:28 UTC · model grok-4.3

classification 💻 cs.CR
keywords IoT firmwareBusyBoxcross-architecture retrievalvulnerability localizationstripped binariespatch-state profilingbinary function matchingevolution-aware analysis
0
0 comments X

The pith

EvoPatch-IoT localizes homologous vulnerable functions in stripped BusyBox IoT firmware across architectures by combining instruction features, graph statistics, geometric priors, and version history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EvoPatch-IoT as a retrieval system that matches functions between BusyBox binaries compiled for different processors even after stripping removes names and paths. It draws on anonymous instruction sequences, control-flow graph properties, binary-specific layout cues, and records of how functions evolved across past releases to rank likely matches. Evaluated on 57 versions and over a thousand architecture pairs drawn from 285 stripped binaries, the system reaches 34.56 percent weighted top-1 accuracy and 56.24 percent top-10 accuracy while shrinking the number of functions that must be checked by hand by nearly 99 percent. A related transfer test on version changes yields high ROC-AUC scores, and a proxy for a known CVE patch status reaches 82 percent accuracy across held-out architectures. The work therefore supplies concrete evidence that evolution-aware binary matching can scale vulnerability auditing for the heterogeneous, stripped firmware typical of IoT devices.

Core claim

EvoPatch-IoT demonstrates that an evolution-aware cross-architecture retrieval framework, built from anonymous instruction and context features, graph-level statistics, per-binary geometric priors, and historical function prototypes, can localize homologous functions in stripped BusyBox binaries without symbols, source paths, or version strings at query time, as shown by consistent outperformance of baselines on a benchmark covering 57 versions, 1,020 directed architecture pairs, and 285 stripped binaries.

What carries the argument

Evolution-aware cross-architecture retrieval framework that integrates anonymous instruction/context features, graph-level statistics, per-binary geometric priors, and historical function prototypes to rank homologous functions for vulnerability localization.

If this is right

  • The framework reduces expected manual review effort by 98.98 percent on the 1,020 architecture-pair test set.
  • It outperforms the strongest baseline by 16.04 percentage points in weighted Hit@1 and 26.85 points in Hit@10.
  • Performance remains best on 56 of 57 versions and holds on difficult architecture pairs.
  • A version-change transfer experiment reaches a mean ROC-AUC of 0.9887.
  • A CVE-2021-42386 patch-state proxy attains 82.44 percent mean accuracy and 88.47 percent mean F1 across held-out architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same feature combination could be applied to other widely reused IoT libraries once similar historical release data are collected.
  • Integrating the ranked outputs directly into automated patch scanners might further reduce the time between firmware release and vulnerability detection.
  • Extending the geometric priors to account for vendor-specific linking order could improve accuracy on firmware that deviates from standard BusyBox build patterns.
  • The benchmark construction process itself offers a reusable template for creating evaluation sets for other stripped binary retrieval tasks.

Load-bearing premise

The chosen anonymous features and geometric priors remain sufficiently distinctive for correct function matching even after vendors apply extra stripping, obfuscation, or compiler changes not seen in the benchmark binaries.

What would settle it

Running the system on a fresh collection of real IoT firmware images that use heavier obfuscation or newer compiler flags and observing a sharp drop below the reported Hit@1 and Hit@10 rates would indicate the features are not discriminative enough.

Figures

Figures reproduced from arXiv: 2604.19496 by Huixi Li, Yinhao Xiao, Yongluo Shen.

Figure 1
Figure 1. Figure 1: Overview of EvoPatch-IoT. The pipeline starts from heterogeneous IoT firmware ecosystems and long-term BusyBox source releases, compiles paired [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark overview. Our dataset jointly exposes architecture coverage, large-scale analysis-function labels, stripped-function recovery statistics, and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall method ranking under the unified stripped-compatible protocol. EvoPatch-IoT leads both Hit@1 and Hit@10 across the 57-version evaluation. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Version-wise weighted Hit@10 on the 57-version benchmark. EvoPatch-IoT remains consistently above the strongest baselines from early to recent [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture-pair retrieval analysis. Left: mean Hit@10 of EvoPatch-IoT across all versions. Right: absolute Hit@10 gain over the strong ShapeStat [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Patch-state case study for CVE-2021-42386. Left: binary-level patch-state proxy across held-out architectures. Right: cross-architecture relative size [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Operational analysis of EvoPatch-IoT. Left: architecture-wise extraction time and anonymous match ratio. Right: average Hit@10 when each architecture [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

BusyBox is one of the most widely reused userland components in Linux-based Internet-of-Things (IoT) firmware, yet its security assessment remains difficult because firmware images are frequently stripped, vendor patch practices are inconsistent, and the same source component is compiled for heterogeneous architectures. We propose EvoPatch-IoT, an evolution-aware cross-architecture retrieval framework for stripped BusyBox firmware binaries. EvoPatch-IoT combines anonymous instruction/context features, graph-level statistics, per-binary geometric priors, and historical function prototypes to localize homologous and potentially vulnerable functions without relying on symbols, source paths, or version strings at test time. We further construct a large-scale BusyBox benchmark from 57 historical versions, 270 unstripped binaries, 285 stripped binaries, and 130 source releases, yielding 1,550,752 function-symbol rows, 1,290,369 analysis-function rows, and 155,845 high-confidence stripped-to-unstripped matches. On 57 fully covered versions and 1,020 directed architecture pairs, EvoPatch-IoT achieves a weighted Hit@1 of 34.56\% and Hit@10 of 56.24\%, outperforming the strongest baseline by 16.04\% and 26.85\%, respectively, and reducing the expected manual inspection space by 98.98\%. The method is best on 56 of 57 versions and maintains consistent advantages on difficult architecture pairs. In addition, a version-change transfer study reaches a mean ROC-AUC of 0.9887, and a CVE-2021-42386 patch-state proxy obtains 82.44\% mean accuracy and 88.47\% mean F1 across held-out architectures. These results show that evolution-aware binary retrieval is a practical foundation for scalable IoT firmware vulnerability auditing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EvoPatch-IoT, an evolution-aware cross-architecture retrieval framework for localizing homologous (and potentially vulnerable) functions in stripped BusyBox IoT firmware binaries. It combines anonymous instruction/context features, graph-level statistics, per-binary geometric priors, and historical function prototypes. The authors construct a benchmark from 57 historical BusyBox versions (270 unstripped + 285 stripped binaries, 130 source releases) and evaluate retrieval on 57 fully covered versions and 1,020 directed architecture pairs, reporting weighted Hit@1 of 34.56% and Hit@10 of 56.24% (outperforming the strongest baseline by 16.04% and 26.85%), plus a 98.98% reduction in expected manual inspection space. Additional results include a version-change transfer study (mean ROC-AUC 0.9887) and a CVE-2021-42386 patch-state proxy (82.44% mean accuracy, 88.47% mean F1).

Significance. If the reported retrieval performance generalizes, the work provides a practical foundation for scalable vulnerability auditing of BusyBox-based IoT firmware by substantially shrinking the manual inspection space. The large-scale, multi-version, multi-architecture benchmark (1,550,752 function-symbol rows and 155,845 high-confidence matches) is a clear strength and could serve as a community resource. The evolution-aware aspect and consistent advantages on difficult architecture pairs are notable.

major comments (1)
  1. [Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the 285 stripped binaries are obtained via symbol stripping on otherwise standard builds from 57 versions. No experiments or analysis address vendor-specific compiler flags (-Os/-O3), different GCC versions, architecture-specific tuning, or light obfuscation that commonly alter CFG structure, instruction sequences, and geometric embeddings in real IoT firmware. This directly affects the load-bearing assumption that the anonymous features, graph statistics, and geometric priors remain sufficiently discriminative, as degradation below the reported Hit@10 of 56.24% would undermine both the retrieval metrics and the claimed 98.98% inspection-space reduction.
minor comments (2)
  1. [Abstract] The abstract states performance numbers and benchmark sizes but omits any description of the similarity metric, feature extraction procedure, training details, or statistical significance testing; this reduces verifiability of the central retrieval claim from the summary alone.
  2. [Version-change transfer study] The version-change transfer study (ROC-AUC 0.9887) is limited to temporal evolution within the same build configuration and does not test cross-optimization robustness, which should be explicitly noted as a limitation when claiming practical applicability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance, the benchmark scale, and the potential practical value for IoT firmware auditing. We address the single major comment point by point below.

read point-by-point responses
  1. Referee: [Evaluation setup and benchmark construction] Benchmark construction (described in the evaluation setup): the 285 stripped binaries are obtained via symbol stripping on otherwise standard builds from 57 versions. No experiments or analysis address vendor-specific compiler flags (-Os/-O3), different GCC versions, architecture-specific tuning, or light obfuscation that commonly alter CFG structure, instruction sequences, and geometric embeddings in real IoT firmware. This directly affects the load-bearing assumption that the anonymous features, graph statistics, and geometric priors remain sufficiently discriminative, as degradation below the reported Hit@10 of 56.24% would undermine both the retrieval metrics and the claimed 98.98% inspection-space reduction.

    Authors: We agree that this is a substantive limitation. The benchmark constructs the 285 stripped binaries solely by stripping symbols from standard builds of the 57 BusyBox versions; it does not include vendor-specific compiler flags, alternate GCC versions, architecture tuning, or any form of obfuscation. These real-world factors can alter CFG structure, instruction sequences, and geometric embeddings, potentially reducing the discriminativeness of the anonymous instruction/context features, graph-level statistics, and per-binary geometric priors. While the evolution-aware historical function prototypes and anonymous embeddings are intended to confer robustness to cross-version and cross-architecture differences, we performed no empirical analysis of their behavior under optimization or obfuscation. Consequently, the reported weighted Hit@1 (34.56%), Hit@10 (56.24%), and 98.98% inspection-space reduction apply strictly to the controlled stripped setting and may not fully generalize. In the revised manuscript we will (1) expand the Evaluation Setup section to explicitly describe the benchmark construction choices, (2) add a dedicated Limitations subsection that states the above assumption and its implications for the claimed metrics, and (3) note that additional adaptation may be required for heavily optimized or obfuscated firmware. This revision will qualify the claims without altering the core experimental results or benchmark description. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical retrieval framework

full rationale

The paper describes an empirical binary retrieval system that extracts anonymous instruction/context features, graph statistics, geometric priors, and historical prototypes to match functions across architectures and versions. All reported metrics (Hit@1, Hit@10, ROC-AUC, CVE proxy accuracy) are computed on a separately constructed benchmark of 57 versions and 1,020 architecture pairs, with explicit held-out elements such as the version-change transfer study and CVE patch-state evaluation. No equations, self-citations, or fitted parameters are presented as deriving the final performance numbers; the results are direct empirical measurements on the benchmark rather than quantities forced by construction from the same inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters, axioms, or invented entities; the method appears to rely on standard binary-analysis assumptions (feature stability under stripping) and empirical matching rather than new postulated entities.

pith-pipeline@v0.9.0 · 5647 in / 1232 out tokens · 37380 ms · 2026-05-10T02:28:49.721336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

  1. [1]

    SAFE: Self-attentive function embeddings for binary similarity,

    L. Massarelli, G. D. Luna, F. Petroni, R. Baldoni, and L. Querzoni, “SAFE: Self-attentive function embeddings for binary similarity,” in Proceedings of the 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA), 2019

  2. [2]

    Palmtree: Learning an assembly language model for instruction embedding,

    X. Li, Y . Qu, and H. Yin, “Palmtree: Learning an assembly language model for instruction embedding,” inProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2021, pp. 3236–3251

  3. [3]

    jtrans: Jump-aware transformer for binary code similarity detection,

    H. Wang, W. Qu, G. Katz, W. Zhu, Z. Gao, H. Qiu, J. Zhuge, and C. Zhang, “jtrans: Jump-aware transformer for binary code similarity detection,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2022, pp. 1–13

  4. [4]

    CLAP: Learning transferable binary code representation with natural language supervision and parameter-efficient prompt tuning,

    P. Gao, J. Gao, H. Wang, Q. Su, C. Zhang, and Z. Liang, “CLAP: Learning transferable binary code representation with natural language supervision and parameter-efficient prompt tuning,” inProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 2024

  5. [5]

    RCFG2Vec: Considering long- distance dependency for binary code similarity detection,

    J. Li, Y . Li, S. Li, Q. Xu, and T. Jia, “RCFG2Vec: Considering long- distance dependency for binary code similarity detection,” inProceed- ings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2024

  6. [6]

    Code is not natural lan- guage: Unlock the power of semantics-oriented graph representation for binary code similarity detection,

    W. Shen, J. Pan, J. Li, Y . Li, Y . Sui, and Y . Liu, “Code is not natural lan- guage: Unlock the power of semantics-oriented graph representation for binary code similarity detection,” in33rd USENIX Security Symposium (USENIX Security), 2024

  7. [7]

    StrTune: Binary code similarity detection via data dependence-based code slicing and contrastive learning,

    M. Wang, X. Huang, Y . Guo, Y . Jiang, S. Jiang, and Q. Wang, “StrTune: Binary code similarity detection via data dependence-based code slicing and contrastive learning,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 3507–3522, 2024

  8. [8]

    Binary2vec: Learning cross-architecture binary embeddings with global attention- enhanced graph neural networks,

    X. Cheng, K. Li, Y . Sui, G. Luan, X. Zhang, and Y . Liu, “Binary2vec: Learning cross-architecture binary embeddings with global attention- enhanced graph neural networks,”Array, vol. 26, p. 100433, 2025

  9. [9]

    Ex2vec: Enhanc- ing cross-architecture binary code similarity detection via end-to-end execution-aware embeddings,

    X. Wang, Y . Sui, G. Luan, X. Zhang, and Y . Liu, “Ex2vec: Enhanc- ing cross-architecture binary code similarity detection via end-to-end execution-aware embeddings,”Journal of Systems Architecture, vol. 164, p. 103453, 2025

  10. [10]

    UniASM: Binary code similarity detection without fine-tuning,

    X. Ma, Y . Sui, G. Luan, X. Zhang, and Y . Liu, “UniASM: Binary code similarity detection without fine-tuning,”Neurocomputing, vol. 657, p. 129864, 2025

  11. [11]

    FirmRec: Automatic and scalable detection of recurring vulnerabilities in binary firmware,

    J. Wang, H. Wang, C. Song, Y . Xu, Y . Li, Y . Sui, and Y . Liu, “FirmRec: Automatic and scalable detection of recurring vulnerabilities in binary firmware,” inProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

  12. [12]

    Opera- tion mango: Scalable discovery of taint-style vulnerabilities in binary firmware services,

    L. Zhang, Z. Zhang, Y . Chen, W. Zou, Z. Liang, and H. Duan, “Opera- tion mango: Scalable discovery of taint-style vulnerabilities in binary firmware services,” in33rd USENIX Security Symposium (USENIX Security), 2024

  13. [13]

    LuaTaint: A static analysis system for web configuration interface vulnerability of internet-of-things devices,

    X. Gao, Z. Lin, G. Cao, and P. Liu, “LuaTaint: A static analysis system for web configuration interface vulnerability of internet-of-things devices,”IEEE Internet of Things Journal, vol. 11, no. 15, pp. 26 110– 26 124, 2024

  14. [14]

    NPFTaint: Detecting highly exploitable vulnerabilities in linux-based iot firmware with network parsing functions,

    L. Lin, J. Zhao, L. Yang, G. Shi, and J. Sun, “NPFTaint: Detecting highly exploitable vulnerabilities in linux-based iot firmware with network parsing functions,”Computers & Security, vol. 148, p. 104054, 2025

  15. [15]

    VEXIR2Vec: Cross-isa binary code similarity detection using intermediate representations,

    A. Rathee, S. Hangal, and R. Moona, “VEXIR2Vec: Cross-isa binary code similarity detection using intermediate representations,”ACM Transactions on Software Engineering and Methodology, 2025

  16. [16]

    BinQuery: A novel framework for natural language-based vulnerability detection in binary code,

    C. Zhou, M. Zhao, Z. Liang, Y . Nan, P. Gao, H. Li, and Z. Liang, “BinQuery: A novel framework for natural language-based vulnerability detection in binary code,” inProceedings of the 34th ACM International Symposium on Software Testing and Analysis (ISSTA), 2025

  17. [17]

    AutoFirm: Automatically identifying reused libraries inside IoT firmware at large scale,

    J. Wang, J. Zhang, Y . Li, W. Shen, and Y . Sui, “AutoFirm: Automatically identifying reused libraries inside IoT firmware at large scale,”arXiv preprint arXiv:2408.08946, 2024