pith. sign in

arxiv: 2606.06570 · v1 · pith:CRR663QPnew · submitted 2026-06-04 · 💻 cs.CR · cs.AI

MalTree: Tracing Malware Evolution from Embeddings at Scale

Pith reviewed 2026-06-28 00:17 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords malware evolutionphylogenetic analysisembeddingstemporal consistencymalware familiesUPGMANeighbor-Joining
0
0 comments X

The pith

MalTree applies phylogenetic techniques to malware embeddings to recover evolutionary relationships that match real timelines in 87 percent of cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that bioinformatics-inspired phylogenetic methods can be applied at scale to embeddings of malware features to automatically model evolutionary relationships among families. A sympathetic reader would care because this could shift malware defense from reacting to individual samples after they appear to anticipating lineages based on how families evolve over time. The framework uses structural, behavioral, and image-based features and validates the trees against actual emergence timestamps from VirusTotal.

Core claim

MalTree converts malware samples into embeddings using multiple feature types and applies UPGMA and Neighbor-Joining algorithms to construct phylogenetic trees. When tested, these trees show 87 percent temporal consistency with real-world timelines, and the analysis reveals that mutation rates vary by more than a factor of ten across families. Inferred relationships for cases like the Mirai botnet match documented threat intelligence.

What carries the argument

The MalTree framework, which extracts embeddings from structural, behavioral, and image-based features and builds phylogenetic trees using clustering algorithms to infer evolutionary order.

If this is right

  • Malware families can be grouped by their evolutionary tempo, allowing detection to be adjusted for fast-mutating ones.
  • Inferred trees provide a way to anticipate new variants based on lineage patterns.
  • Malware analysis can transition from classifying individual samples to modeling family lineages.
  • Phylogenetic validation using timestamps offers a scalable alternative to manual reverse engineering for lineage discovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding-to-tree approach might apply to tracking evolution in other domains like software vulnerabilities or network protocols.
  • If the method scales, it could integrate with existing ML detectors to flag samples likely to be related to known fast-evolving families.
  • Future work could test whether adding more feature types improves consistency beyond 87 percent.

Load-bearing premise

The embeddings built from structural, behavioral, and image-based features contain enough evolutionary signal that phylogenetic algorithms can reconstruct the actual order in which malware families emerged.

What would settle it

Finding a set of malware families where the phylogenetic tree topology built from embeddings places many samples in an order that contradicts their known first-seen timestamps from a reliable source.

Figures

Figures reproduced from arXiv: 2606.06570 by Akash Amalan, Georgios Smaragdakis, Tom J. Viering.

Figure 1
Figure 1. Figure 1: Simplified version of phylogenetic tree that illustrates malware evolutionary relationships. Abstract Malware detection remains largely reactive: ma￾chine learning models trained on known samples degrade as threats evolve. Understanding evo￾lutionary relationships among malware families can inform proactive defense, but traditional re￾verse engineering can take months to years to uncover such lineage relat… view at source ↗
Figure 2
Figure 2. Figure 2: A rooted phylogenetic tree with four taxa. Leaves (shaded) correspond to observed samples; internal nodes represent inferred ancestors. Path Distance. For any two nodes u, v ∈ V , let path(u, v) ⊆ E denote the unique path between them. The path distance dT (u, v) = P e∈path(u,v) w(e) defines a met￾ric on V ; when both nodes are leaves, this is the patristic distance. Most Recent Common Ancestor (MRCA). In … view at source ↗
Figure 3
Figure 3. Figure 3: MalTree pipeline. Left: Multi-modal embedding extraction produces pseudo-static (es), dynamic (ed), and image (ei) representations. These are concatenated and reduced, from which pairwise distances yield matrix D. Right: Tree construction via Neighbor-Joining reveals family-level structure, with clades corresponding to functional categories. Pseudo-static embedding (es ∈ R 3512). From each mem￾ory dump, we… view at source ↗
Figure 4
Figure 4. Figure 4: Phylogenetic tree for temporal analysis. Leaves show first-submission year. If Li < Lj for samples sharing an MRCA, then t(si) < t(sj ) should hold. (see Appendix J for details). If mutation rates were uniform, min/max drift values would be consistent across families. High variance indicates non-uniform evolution, favoring NJ over UPGMA. Inter-family evolutionary inference. Beyond validation, we use trees … view at source ↗
Figure 5
Figure 5. Figure 5: Embedding drift (distance/year, log scale) varies substan￾tially across families. labels used during embedding extraction, reinforcing that embeddings capture genuine evolutionary structure rather than merely reconstructing the training taxonomy. The re￾ported 87.1% is measured on the full tree before any filtering. Removing 5,385 intra-family outliers (of 103,883) raises consistency only to 88.5%, a +1.4 … view at source ↗
Figure 6
Figure 6. Figure 6: Mirai inter-family subgraph with edge weights. Red: val￾idated by threat intelligence; gray: lacking corroborating evidence. Lower weights indicate stronger phylogenetic support. We corroborate these relationships with static features that are independent of the embeddings used to build the tree, computing Jaccard similarity over combined import/export symbol sets extracted with LIEF (Appendix M) [PITH_FU… view at source ↗
Figure 7
Figure 7. Figure 7: illustrates our validation process for ensuring sample quality and accurate family labels using VirusTotal consensus. Obtain SHA from Collection Is SHA in VirusTotal? Family Label matches Most Popular VirusTotal label? Family Label agrees with 30% VirusTotal AV labels? Dismiss Sample Accept Sample True False True False True False [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: illustrates this calculation for a simple case. aW W1 L1 = 8 W2 L2 = 6 Family WpBruteBot (MRCA) Lateral Distance: dlateral(W1, W2) = L1 + L2 = 8 + 6 = 14 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Median lateral distance calculation. Sample W4 has substantially higher median lateral distance compared to others, indicating it is an outlier. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of outliers on tree topology. Left: With outlier C4 present, its extreme divergence from other Family C members (C1, C2, C3) causes Family C to be placed far from Family A during tree construction. Families A and C share MRCA at root r, inflating distances. Right: After removing C4 and rebuilding, the remaining Family C samples (C1, C2, C3) cluster with Family A, sharing MRCA at intermediate node T… view at source ↗
Figure 11
Figure 11. Figure 11: Path-length intuition. Sample A1 with shorter branch has fewer modifications from ancestor, suggesting earlier emergence. Sample A2 with longer branch has more modifications, suggesting later emergence. G.2. Global Clock vs. Local Clock A critical distinction: our path-length assumption does not require a global molecular clock (Bromham & Penny, 2003). It assumes only local rate homogeneity between the im… view at source ↗
Figure 12
Figure 12. Figure 12: Variable mutation rates across families. The fast-evolving and slow-evolving families operate at different absolute rates, yet within each family the relative branch lengths still recover divergence order. Our ordering requires only this within-pair (local) consistency, not equal rates across families. Key insight: We compare siblings within the same family, where both variants were likely produced by the… view at source ↗
Figure 13
Figure 13. Figure 13: Edge weight calculation example. Family FA has median distance 15 to the shared MRCA; Family FB has median distance 25. The resulting edge points from FA to FB with weight 15. • Our approach: Retains the minimum-weight outgoing edge from each node, preserving edge directionality (from earlier￾diverging to later-diverging families). Each family points to its single most likely progenitor based on phylogene… view at source ↗
Figure 14
Figure 14. Figure 14: SmokeLoader inter-family subgraph with edge weights. Green edges denote validated delivery chain associations; gray dashed edges indicate connections to APT families. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: DiscordTokenStealers hub pattern showing 10 representative connections (of 22 total); edge weights in the figure range from w = 4.5 to w = 14.1. The DiscordTokenStealers family exhibits a hub pattern with 22 inferred connections. This family comprises credential￾harvesting tools targeting Discord authentication tokens, implementing file system traversal and HTTP exfiltration [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 16
Figure 16. Figure 16: Conti/TrickBot ecosystem fragmentation. Red nodes represent the documented Wizard Spider threat actor’s toolset, which should cluster together but instead scatter across unrelated hub families. Dashed box indicates the expected phylogenetic grouping. The Conti ransomware ecosystem represents a well-documented threat actor cluster that MalTree fails to recover. The Wizard Spider group operated TrickBot (ba… view at source ↗
Figure 17
Figure 17. Figure 17: Image embedding pipeline. Model V1 (ResNet-50 with ImageNet weights) is trained on public malware image datasets, then fine-tuned as Model V2 on our dataset. The 2048-dimensional penultimate layer activation serves as ei [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: illustrates the fusion architecture. Multi-modal embeddings are concatenated, normalized, and passed through a two-layer network to produce a unified 1000-dimensional representation. es (3512-d) ed (1000-d) ei (2048-d) Concatenate (6560-d) L2 Normalize Linear (6560 → 1000) + ReLU Linear (1000 → 538) e (1000-d) ← Extract representation ← Discarded after training [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Cross-modality drift agreement. Axes are normalized drift rates (cross-year drift scaled by within-year spread) for two modalities. Families that drift rapidly in one modality do so in the others, with Spearman ρ = 0.90 between image and pseudo-static and ρ = 0.75 between pseudo-static and dynamic across 272 families. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
read the original abstract

Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformatics inspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MalTree, a framework that extracts structural, behavioral, and image-based features from malware samples into embeddings and applies phylogenetic tree-construction methods (UPGMA and Neighbor-Joining) to infer evolutionary relationships at scale. It reports 87% temporal consistency between the inferred trees and VirusTotal first-seen timestamps, claims that some families mutate more than 10 times faster than others, and presents case studies (including Mirai) said to align with documented threat intelligence.

Significance. If the temporal validation is robust, the work would supply a scalable, automated approach to lineage-aware malware modeling that could support family-specific detection tuning. The application of established phylogenetic algorithms to large embedding spaces is a clear methodological contribution, though its value hinges on whether the recovered order reflects genuine evolutionary precedence rather than artifacts of the chosen features or validation proxy.

major comments (2)
  1. [Abstract] Abstract (temporal validation paragraph): the 87% consistency figure is obtained by comparing inferred tree order against VirusTotal first-seen timestamps, yet the manuscript supplies no quantitative controls for the well-documented delays and submission biases in those timestamps (samples often circulate weeks or months before upload, and upload rates vary by family and region). Without such controls or an alternative ground-truth comparison, the reported alignment does not establish that the phylogenetic ordering recovers actual lineage precedence.
  2. [Abstract] Abstract (feature and method description): the claim that structural, behavioral, and image-based embeddings contain sufficient evolutionary signal for phylogenetic recovery is asserted without an ablation or control experiment that isolates recency-correlated signals (e.g., packing artifacts or behavioral telemetry that may correlate with upload date) from true mutational order. This leaves open the possibility that the distance metric already encodes temporal information, rendering the 87% figure partly tautological.
minor comments (1)
  1. [Abstract] The abstract states that mutation rates differ by more than 10× but does not define how branch lengths or substitution rates are computed from the trees or normalized across families; a brief methods paragraph or equation would clarify this metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the validation of MalTree. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (temporal validation paragraph): the 87% consistency figure is obtained by comparing inferred tree order against VirusTotal first-seen timestamps, yet the manuscript supplies no quantitative controls for the well-documented delays and submission biases in those timestamps (samples often circulate weeks or months before upload, and upload rates vary by family and region). Without such controls or an alternative ground-truth comparison, the reported alignment does not establish that the phylogenetic ordering recovers actual lineage precedence.

    Authors: We agree that VirusTotal timestamps are an imperfect proxy subject to delays and biases. In the revised manuscript we will add a dedicated limitations subsection that quantifies these effects (e.g., timestamp discrepancy distributions across families) and reports sensitivity analyses on high-confidence subsets. No large-scale alternative ground truth for malware lineage order is publicly available, which motivated our choice of proxy; we will make this caveat explicit while retaining the 87% figure as an indicative rather than definitive result. revision: partial

  2. Referee: [Abstract] Abstract (feature and method description): the claim that structural, behavioral, and image-based embeddings contain sufficient evolutionary signal for phylogenetic recovery is asserted without an ablation or control experiment that isolates recency-correlated signals (e.g., packing artifacts or behavioral telemetry that may correlate with upload date) from true mutational order. This leaves open the possibility that the distance metric already encodes temporal information, rendering the 87% figure partly tautological.

    Authors: The referee correctly identifies the absence of ablations that separate mutational signal from recency-correlated artifacts. We will add these experiments in revision: (1) per-embedding-type trees with separate temporal-consistency scores, (2) controls that shuffle timestamps or mask known recency features (packing, telemetry volume), and (3) comparison against distance metrics that explicitly remove date-correlated dimensions. These additions will clarify whether the recovered order reflects evolutionary precedence beyond feature recency. revision: yes

Circularity Check

0 steps flagged

No circularity: external timestamps provide independent validation

full rationale

The derivation builds embeddings from structural/behavioral/image features, applies standard phylogenetic algorithms (UPGMA, Neighbor-Joining), and scores the resulting trees against VirusTotal first-seen timestamps to obtain the 87% temporal consistency figure. Because the timestamps are an external, non-derived data source unrelated to the embedding construction or tree inference steps, the reported consistency does not reduce to a fitted parameter, self-definition, or self-citation chain. The central claim therefore remains self-contained against an independent benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract, which invokes standard phylogenetic algorithms and the assumption that chosen features capture evolutionary signal but lists no explicit free parameters or new entities.

axioms (1)
  • domain assumption UPGMA and Neighbor-Joining algorithms applied to feature embeddings recover evolutionary order when the features contain phylogenetic signal
    Invoked by the choice to apply these methods to malware data in the abstract.

pith-pipeline@v0.9.1-grok · 5703 in / 1155 out tokens · 26687 ms · 2026-06-28T00:17:24.517746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 9 canonical work pages

  1. [1]

    In: Mulder, V., Mermoud, A., Lenders, V., Tellenbach, B

    URL https://any.run/ . Accessed: 2025- 01-27. Badhwar, R. Polymorphic and metamorphic malware. In The CISO’s Next Frontier: AI, Post-Quantum Cryptog- raphy and Advanced Security Paradigms, pp. 279–285. Springer International Publishing, 2021. doi: 10.1007/97 8-3-030-75354-2 35. Bozkir, A., Cankaya, A., and Aydos, M. Utilization and comparision of convolut...

  2. [2]

    Accessed: 2025-01-26

    URL https://www.cisa.gov/news-eve nts/cybersecurity-advisories/aa21-265 a. Accessed: 2025-01-26. CYFIRMA Research. Vidar stealer: An in-depth analysis of an information-stealing malware. CYFIRMA Threat Research, 2024. URL https://www.cyfirma.co m/research/vidar-stealer-an-in-depth -analysis/. Accessed: 2025-01-26. Day, W. and Edelsbrunner, H. Efficient al...

  3. [3]

    Felsenstein, J

    doi: 10.1086/282802. Felsenstein, J. Confidence limits on phylogenies: An ap- proach using the bootstrap.Evolution, 39(4):783–791,

  4. [4]

    Felsenstein, J.Inferring Phylogenies

    doi: 10.1111/j.1558-5646.1985.tb00420.x. Felsenstein, J.Inferring Phylogenies. Sinauer Associates, Sunderland, MA, 2004. ISBN 0-87893-177-5. Freitas, S., Duggal, R., and Chau, D. H. Malnet: A large- scale image database of malicious software. InProceed- ings of the 31st ACM International Conference on In- formation & Knowledge Management, pp. 3948–3952, 2...

  5. [5]

    Huntress Labs

    doi: 10.1093/molbev/msw046. Huntress Labs. SmokeLoader malware: Analysis, detection, removal. Huntress Threat Library, 2024. URL https: //www.huntress.com/threat-library/mal ware/smokeloader. Accessed: 2025-01-26. Huson, D. H. and Scornavacca, C. A survey of combinato- rial methods for phylogenetic networks.Genome Biology and Evolution, 3:23–35, 11 2010. ...

  6. [6]

    Accessed: 2025-01-26

    URL https://www.proofpoint.com/u s/blog/threat-insight/major-botnets-d isrupted-global-law-enforcement-taked own. Accessed: 2025-01-26. Quynh, N. A. Capstone: Next-gen disassembly framework. Black Hat USA, 2014. URL https://www.capsto ne-engine.org. Accessed: 2025-01-27. Raff, E., Barker, J., Sylvester, J., Brandon, R., Catanzaro, B., and Nicholas, C. Mal...

  7. [7]

    Accessed: 2025-01-26

    URL https://www.cyberark.com/res ources/threat-research-blog/chatting -our-way-into-creating-a-polymorphic -malware. Accessed: 2025-01-26. Simonsen, M., Mailund, T., and Pedersen, C. N. S. Rapid neighbour-joining. InAlgorithms in Bioinformatics, vol- ume 5251, pp. 113–122. Springer Berlin Heidelberg,

  8. [8]

    Sokal, R

    doi: 10.1007/978-3-540-87361-7 10. Sokal, R. R. and Michener, C. D. A statistical method for evaluating systematic relationships.University of Kansas Science Bulletin, 38:1409–1438, 1958. Suarez-Tangil, G., Tapiador, J. E., Peris-Lopez, P., and Blasco, J. Dendroid: A text mining approach to analyzing and classifying code structures in Android malware fami...

  9. [9]

    Sultanik, E

    doi: 10.1016/j.eswa.2013.07.106. Sultanik, E. bin2png: Converts arbitrary binary data to RGB image. GitHub Repository, 2020. URL https: //github.com/ESultanik/bin2png. Accessed: 2025-01-26. The Hacker News. New Moobot botnet exploiting Hikvision camera vulnerability. The Hacker News, 2022. URL https://thehackernews.com/2022/12/mo obot-botnet.html. Accesse...

  10. [10]

    doi: 10.1109/INNOV ATIONS.2012.6 207739

    IEEE, 2012. doi: 10.1109/INNOV ATIONS.2012.6 207739. VirusTotal. VirusTotal. Google Cloud, 2012. URL ht tps://www.virustotal.com . Founded 2004 by Hispasec Sistemas; acquired by Google 2012. Accessed: 2025-01-26. 12 MalTree: Tracing Malware Evolution from Embeddings at Scale VMRay. Malware analysis spotlight: Smoke Loader. VM- Ray Blog, 2024. URL https://...

  11. [11]

    Identify immediate parent nodea

  12. [12]

    Find all sibling pairs(s i, sj)descending froma

  13. [13]

    For each pair: • ComputeL i =d T (si, a)andL j =d T (sj, a) • Get timestampst(s i)andt(s j)from VirusTotal • Check:(L i < L j ∧t(s i)< t(s j))or(L i > L j ∧t(s i)> t(s j))

  14. [14]

    Temporal consistency = proportion of consistent pairs Our 87% temporal consistency confirms the assumption holds in practice. H. Inter-Family Analysis This section explains our methodology for inferring evolutionary relationships between malware families and constructing the inter-family graphs shown in case studies. H.1. Motivation When comparing differe...

  15. [15]

    Identify their shared MRCA:r=MRCA(F A,F B)

  16. [16]

    Compute the path distance fromrto each leaf in both families 21 MalTree: Tracing Malware Evolution from Embeddings at Scale

  17. [17]

    The median represents the “typical” evolutionary depth for each family

    Calculate the median distance for each family: ˜dA =median{d T (s, r) :s∈ F A}(11) ˜dB =median{d T (s, r) :s∈ F B}(12) The median is chosen for robustness (Rousseeuw & Croux, 1993): each family contains many samples at varying distances, and the mean would be sensitive to extreme values. The median represents the “typical” evolutionary depth for each fami...

  18. [18]

    • Our approach: Retains theminimum-weight outgoing edge from each node, preserving edge directionality (from earlier- diverging to later-diverging families)

    The resulting edge points fromF A toF B with weight 15. • Our approach: Retains theminimum-weight outgoing edge from each node, preserving edge directionality (from earlier- diverging to later-diverging families). Each family points to its single most likely progenitor based on phylogenetic distance, rather than optimizing a global objective. These approa...