pith. sign in

arxiv: 2607.01445 · v1 · pith:ESMGXRSXnew · submitted 2026-07-01 · 💻 cs.CR · cs.LG

Hamm-Grams: An Algorithm for Mining Regular Expressions of Bytes

Pith reviewed 2026-07-03 19:58 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords hamm-gramsmalware detectionn-gramslocality-sensitive hashingregular expressionsbyte sequencesclusteringmachine learning features
0
0 comments X

The pith

Hamm-grams mined via locality-sensitive hashing and clustering produce more robust byte features than n-grams for malware classification and detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes hamm-grams as fixed-length byte patterns containing a single wildcard character to serve as more robust features than traditional n-grams in malware machine learning tasks. These patterns are located by applying a locality-sensitive hash that collides on strings differing by a small Hamming distance, then clustering the colliding strings inside each bucket to decide where the wildcard should be placed. The resulting features tolerate minor byte variations that commonly appear in malware variants, reducing brittleness. A sympathetic reader would care because static n-gram features are known to break under small mutations that do not change the overall malicious behavior. The abstract and description indicate that the approach is demonstrated on malware classification and detection benchmarks.

Core claim

Hamm-grams are a special class of regular expressions consisting of fixed-length byte sequences with exactly one single-character wildcard. They are mined by first using a new locality-sensitive hash designed to produce collisions among pairs with small Hamming distance, followed by clustering within each hash bucket to place the wildcard at the position of greatest variation. The paper states that these features are more robust than n-grams and yield better results in malware classification and detection tasks.

What carries the argument

The hamm-gram mining algorithm that applies locality-sensitive hashing tuned for small Hamming distance followed by intra-bucket clustering to insert wildcards.

If this is right

  • Hamm-grams reduce the brittleness of static byte features to small mutations typical in malware variants.
  • The LSH-plus-clustering procedure mines useful patterns more efficiently than exhaustive search over all possible wildcard placements.
  • Regular-expression-style features with single wildcards capture structural similarities that exact n-grams miss.
  • The same mining pipeline can be applied to any large corpus of byte sequences where small variations are expected.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the method works, similar wildcard patterns could be tested on other noisy sequence domains such as network traffic or file-format parsing.
  • Performance will depend on how frequency and utility thresholds are chosen in practice, since the abstract does not specify explicit cutoffs.

Load-bearing premise

The locality-sensitive hash and subsequent clustering step will reliably surface hamm-grams that are both frequent enough to be useful and sufficiently discriminative for malware tasks.

What would settle it

A head-to-head experiment on the same malware datasets where standard n-gram features achieve equal or higher classification accuracy than hamm-grams would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2607.01445 by Derek Everett, Edward Raff, James Holt.

Figure 1
Figure 1. Figure 1: An illustration of how the complex hashing function tends to put similar sequences into the same hash bucket, illustrated [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (left) Approximate probability of two 𝑛-grams having a single-bit hash code collision given that they differ by Hamming distance 𝑘. (right) 95% credible interval for collision probability between pairs of (𝑛 = 100)-grams as a function of normalized Hamming distance 𝐻𝑑 /𝑛. assigned the same bit (there are two possible bits). In practice, we use 𝑀-bit hash codes, and the probability of an 𝑀-bit collision wil… view at source ↗
Figure 3
Figure 3. Figure 3: Balanced accuracy scores of Logistic Regression classifiers trained on the top- [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between hamm-gram features and three additional hash baselines, ssdeep, sdhash, and LZJD. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Malware poses a critical and ever-evolving threat, and robust and effective systems for detecting and classifying malware are of essential importance. $n$-grams features are among the common static features used in effective machine learning systems for malware, but these features are inherently brittle. We propose an algorithm for constructing more robust features, hamm-grams, which are a special class of regular expressions having a fixed length and single-character wildcards. We devise an efficient algorithm for finding common hamm-grams using a new locality-sensitive hash designed to produce collisions among pairs of small Hamming distance and a clustering within hash buckets to place wildcards. We then demonstrate the advantages of these features in malware classification and detection tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces hamm-grams, a class of fixed-length regular expressions over bytes that allow single-character wildcards. These are mined via a locality-sensitive hash (LSH) designed to collide on small Hamming distances, followed by intra-bucket clustering to determine wildcard positions. The central claim is that the resulting hamm-grams yield more robust features than conventional n-grams for malware classification and detection.

Significance. If the LSH-plus-clustering procedure can be shown to produce reproducible, discriminative hamm-grams that demonstrably improve robustness under realistic dataset shift, the work would supply a practical alternative to brittle n-gram features in static malware analysis pipelines.

major comments (2)
  1. [Mining algorithm description] Mining algorithm section: no concrete selection rule (support threshold, entropy cutoff, cross-validation score, or coverage criterion) is stated for deciding which candidate hamm-grams survive to the final feature set after LSH and clustering. Without such a rule the claimed robustness advantage cannot be reproduced and may depend on post-hoc choices rather than the LSH construction itself.
  2. [Experimental section] Experimental evaluation: the manuscript provides no description of how robustness is quantified or how controls for dataset shift (temporal splits, family-wise cross-validation, or adversarial variants) are implemented. This omission is load-bearing for the central claim that hamm-grams outperform n-grams in robustness.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'demonstrate the advantages' is vague; the specific tasks, datasets, and metrics should be named.
  2. [Introduction / Preliminaries] Notation: an early concrete example of a hamm-gram (e.g., the byte pattern with wildcard positions) would clarify the definition before the algorithmic description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The two major comments identify important omissions that affect reproducibility. We will revise the manuscript to address both points explicitly.

read point-by-point responses
  1. Referee: [Mining algorithm description] Mining algorithm section: no concrete selection rule (support threshold, entropy cutoff, cross-validation score, or coverage criterion) is stated for deciding which candidate hamm-grams survive to the final feature set after LSH and clustering. Without such a rule the claimed robustness advantage cannot be reproduced and may depend on post-hoc choices rather than the LSH construction itself.

    Authors: We agree that the current manuscript does not state an explicit selection rule. In the revised version we will add a subsection under the mining algorithm that specifies the exact criteria used: a minimum support threshold of 0.01 (fraction of samples) combined with an entropy cutoff of 0.8 on the wildcard positions, chosen via a small held-out validation set. This rule will be applied uniformly after the clustering step. revision: yes

  2. Referee: [Experimental section] Experimental evaluation: the manuscript provides no description of how robustness is quantified or how controls for dataset shift (temporal splits, family-wise cross-validation, or adversarial variants) are implemented. This omission is load-bearing for the central claim that hamm-grams outperform n-grams in robustness.

    Authors: We acknowledge the omission. The revised experimental section will define robustness as the relative drop in F1-score when the test distribution is shifted (temporal split by collection date and family-wise cross-validation). We will report results under both a random split and the temporal split, and will include a brief description of how the temporal split was constructed from the dataset metadata. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic construction without self-referential derivations or fitted predictions

full rationale

The paper presents an algorithmic procedure for mining hamm-grams via LSH for small Hamming distance followed by clustering to insert wildcards. No equations, parameter fits, predictions of derived quantities, or self-citations appear in the abstract or described method. The central claim is that the resulting features are more robust than n-grams for malware tasks, but this is an empirical assertion about the algorithm's output rather than a derivation that reduces to its own inputs by construction. The absence of any load-bearing self-definition, fitted-input-as-prediction, or uniqueness theorem means the derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The implicit modeling choice that single-wildcard patterns will be both frequent and discriminative is treated as an unstated domain assumption rather than a derived quantity.

pith-pipeline@v0.9.1-grok · 5638 in / 1168 out tokens · 17043 ms · 2026-07-03T19:58:09.264120+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    H. S. Anderson and P. Roth. 2018. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models.ArXiv e-prints(April 2018). arXiv:1804.04637 [cs.CR]

  2. [2]

    Dana Angluin. 1982. Inference of Reversible Languages.J. ACM29, 3 (jul 1982), 741–765. doi:10.1145/322326.322334

  3. [3]

    Dana Angluin, Sarah Eisenstat, and Dana Fisman. 2015. Learning Regular Languages via Alternating Automata. InProceedings of the 24th International Conference on Artificial Intelligence (IJCAI’15). AAAI Press, 3308–3314

  4. [4]

    Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. Drebin: Effective and explainable detection of android malware in your pocket.. InNdss, Vol. 14. 23–26

  5. [5]

    Marcus Botacin, Felipe Duarte Domingues, Fabrício Ceschin, Raphael Machnicki, Marco Antonio Zanata Alves, Paulo Lício de Geus, and André Grégio. 2021. AntiViruses under the Microscope: A Hands-On Perspective.Computers & Securityi (oct 2021), 102500. doi:10.1016/j.cose.2021.102500

  6. [6]

    Zdenek Ceska, Ivo Hanak, and Roman Tesar. 2007. Teraman: A Tool for N-gram Extraction from Large Datasets. In2007 IEEE International Conference on Intelligent Computer Communication and Processing. IEEE, 209–216. doi:10.1109/ICCP.2007.4352162

  7. [7]

    F Coste and J Nicolas. 1997. Regular inference as a graph coloring problem.Workshop on Grammatical Inference, Automata Induction, and Language Acquisition (ICML’97)(1997). citeseer.nj.nec.com/coste97regular.html

  8. [8]

    Curtin, Fred Lu, Edward Raff, and Priyanka Ranade

    Ryan R. Curtin, Fred Lu, Edward Raff, and Priyanka Ranade. 2026. Intermediate N-Gramming: Deterministic and Fast N-Grams For Large N and Large Datasets. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. arXiv. doi:10.48550/arXiv.2511.14955 arXiv:2511.14955 [cs]

  9. [9]

    Jake Drew, Tyler Moore, and Michael Hahsler. 2016. Polymorphic Malware Detection Using Sequence Classification Methods. In2016 IEEE Security and Privacy Workshops (SPW). IEEE, 81–87. doi:10.1109/SPW.2016.30

  10. [10]

    Pierre Dupont, Laurent Miclet, and Enrique Vidal. 1994. What Is the Search Space of the Regular Inference?. InProceedings of the Second International Colloquium on Grammatical Inference and Applications (ICGI ’94). Springer-Verlag, Berlin, Heidelberg, 25–37

  11. [11]

    Siddhant Gupta, Fred Lu, Andrew Barlow, Edward Raff, Francis Ferraro, Cynthia Matuszek, Charles Nicholas, and James Holt. 2024. Living off the Analyst: Harvesting Features from Yara Rules for Malware Detection. In2024 IEEE International Conference on Big Data (BigData). 2624–2634. doi:10.1109/BigData62323.2024.10825735 ISSN: 2573-2978

  12. [12]

    2014.Robust Static Analysis of Portable Executable Malware

    Katja Hahn. 2014.Robust Static Analysis of Portable Executable Malware. Master thesis. HTWK Leipzig

  13. [13]

    R. W. Hamming. 1950. Error detecting and error correcting codes.The Bell System Technical Journal29, 2 (1950), 147–160. doi:10.1002/j.1538- 7305.1950.tb00463.x

  14. [14]

    Paul Jaccard. 1912. The distribution of the flora in the alpine zone. 1.New phytologist11, 2 (1912), 37–50

  15. [15]

    Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis. InProceedings of the 18th ACM conference on Computer and communications security - CCS. ACM Press, New York, New York, USA, 309–320. doi:10.1145/2046707.2046742

  16. [16]

    Robert J Joyce, Dev Amlani, Charles Nicholas, and Edward Raff. 2022. MOTIF: A Large Malware Reference Dataset with Ground Truth Family Labels. InThe AAAI-22 Workshop on Artificial Intelligence for Cyber Security (AICS). doi:10.48550/arXiv.2111.15031 arXiv: 2111.15031v1

  17. [17]

    Joyce, Derek Everett, Maya Fuchs, Edward Raff, and James Holt

    Robert J. Joyce, Derek Everett, Maya Fuchs, Edward Raff, and James Holt. 2025. ClarAVy: A Tool for Scalable and Accurate Malware Family Labeling. InCompanion Proceedings of the ACM on Web Conference 2025 (WWW ’25). Association for Computing Machinery, New York, NY, USA, 277–286. doi:10.1145/3701716.3715212 14 Derek Everett, Edward Raff, and James Holt

  18. [18]

    Robert J Joyce, Edward Raff, and Charles Nicholas. 2021. A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security (AISec ’21). Association for Computing Machinery. doi:10.1145/3474369. 3486867 arXiv: 2109.11126v1

  19. [19]

    Jeffrey O Kephart, Gregory B Sorkin, William C Arnold, David M Chess, Gerald J Tesauro, and Steve R White. 1995. Biologically Inspired Defenses Against Computer Viruses. InProceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1 (IJCAI’95). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 985–996. http://dl.acm....

  20. [20]

    Kolter and Marcus A

    Jeremy Z. Kolter and Marcus A. Maloof. 2004. Learning to detect malicious executables in the wild. InProceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’04. ACM Press, New York, New York, USA, 470–478. doi:10.1145/1014052. 1014105

  21. [21]

    Jesse Kornblum. 2006. Identifying almost identical files using context triggered piecewise hashing.Digital investigation3 (2006), 91–97

  22. [22]

    Mineichi Kudo and Masaru Shimbo. 1988. Efficient regular grammatical inference techniques by the use of partial similarities and their logical relationships.Pattern Recognition21, 4 (1988), 401–409. doi:10.1016/0031-3203(88)90053-2

  23. [23]

    Ralph Langner. 2011. Stuxnet: Dissecting a Cyberwarfare Weapon.IEEE Security & Privacy9, 3 (2011), 49–51. doi:10.1109/MSP.2011.67

  24. [24]

    Kirill Levchenko, Andreas Pitsillidis, Neha Chachra, Brandon Enright, Márk Félegyházi, Chris Grier, Tristan Halvorson, Chris Kanich, Christian Kreibich, He Liu, Damon McCoy, Nicholas Weaver, Vern Paxson, Geoffrey M Voelker, and Stefan Savage. 2011. Click Trajectories: End-to-End Analysis of the Spam Value Chain. InProceedings of the 2011 IEEE Symposium on...

  25. [25]

    Ji Lin, Wei-Ming Chen, John Cohn, Chuang Gan, and Song Han. 2020. MCUNet: Tiny Deep Learning on IoT Devices. InAnnual Conference on Neural Information Processing Systems (NeurIPS)

  26. [26]

    2008.Introduction to information retrieval

    Christopher D Manning. 2008.Introduction to information retrieval. Syngress Publishing,

  27. [27]

    Masud, Latifur Khan, and Bhavani Thuraisingham

    Mohammad M. Masud, Latifur Khan, and Bhavani Thuraisingham. 2008. A scalable multi-level feature extraction technique to detect malicious executables.Information Systems Frontiers10, 1 (mar 2008), 33–45. doi:10.1007/s10796-007-9054-3

  28. [28]

    Tirth Patel, Fred Lu, Edward Raff, Charles Nicholas, Cynthia Matuszek, and James Holt. 2023. Small Effect Sizes in Malware Detection? Make Harder Train/Test Splits!Proceedings of the Conference on Applied Machine Learning in Information Security(2023). https://arxiv.org/abs/2312.15813

  29. [29]

    Curtin, Derek Everett, Robert J

    Edward Raff, Ryan R. Curtin, Derek Everett, Robert J. Joyce, and James Holt. 2025. Zipf-Gramming: Scaling Byte N-Grams Up to Production Sized Malware Corpora. InProceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25). Association for Computing Machinery, New York, NY, USA, 5988–5996. doi:10.1145/3746252.3761551

  30. [30]

    Edward Raff, William Fleming, Richard Zak, Hyrum Anderson, Bill Finlayson, Charles Nicholas, and Mark Mclean. 2019. Kilograms: Very large n-grams for malware classification.arXiv preprint arXiv:1908.00200(2019)

  31. [31]

    Edward Raff and Mark McLean. 2018. Hash-Grams On Many-Cores and Skewed Distributions. In2018 IEEE International Conference on Big Data (Big Data). IEEE, 158–165. doi:10.1109/BigData.2018.8622043

  32. [32]

    Edward Raff and Charles Nicholas. 2018. Hash-Grams: Faster N-Gram Features for Classification and Malware Detection. InProceedings of the ACM Symposium on Document Engineering 2018. ACM, Halifax, NS, Canada. doi:10.1145/3209280.3229085

  33. [33]

    Edward Raff and Charles Nicholas. 2018. Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash.Digital Investigation24 (2018), 34–49

  34. [34]

    Rayleigh. 1905. The Problem of the Random Walk.Nature72, 1866 (1905), 318–318. doi:10.1038/072318a0

  35. [35]

    Vassil Roussev. 2010. Data fingerprinting with similarity digests. InAdvances in Digital Forensics VI: Sixth IFIP WG 11.9 International Conference on Digital Forensics, Hong Kong, China, January 4-6, 2010, Revised Selected Papers 6. Springer, 207–226

  36. [36]

    Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. 2009. Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey.Information Security Technical Report14, 1 (2009), 16–29. doi:10.1016/j.istr.2009.03.003

  37. [37]

    Ehsan Shareghi, Daniela Gerz, Ivan Vulić, and Anna Korhonen. 2019. Show Some Love to Your n-grams: A Bit of Progress and Stronger n- gram Language Modeling Baselines. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burst...

  38. [38]

    Charles Smutz and Angelos Stavrou. 2012. Malicious PDF detection using metadata and structural features. InProceedings of the 28th Annual Computer Security Applications Conference. ACM, 239–248

  39. [39]

    Tse and P

    D. Tse and P. Viswanath. 2005.Fundamentals of Wireless Communication. Cambridge University Press. https://books.google.com/books?id= 66XBb5tZX6EC

  40. [40]

    van Baar, H.M.A

    R.B. van Baar, H.M.A. van Beek, and E.J. van Eijk. 2014. Digital Forensics as a Service: A game changer.Digital Investigation11 (2014), S54–S62. doi:10.1016/j.diin.2014.03.007

  41. [41]

    Charlotte Van Camp and Walter Peeters. 2022. A World without Satellite Data as a Result of a Global Cyber-Attack.Space Policy59 (2022), 101458. doi:10.1016/j.spacepol.2021.101458

  42. [42]

    Rabin, Kristopher Micinski, Jeffrey S

    Daniel Votipka, Seth M. Rabin, Kristopher Micinski, Jeffrey S. Foster, and Michelle M. Mazurek. 2019. An Observational Investigation of Reverse Engineers ’ Processes. InUSENIX Security Symposium

  43. [43]

    Andrew Walenstein, Michael Venable, Matthew Hayes, Christopher Thompson, and Arun Lakhotia. 2007. Exploiting similarity between variants to defeat malware. InProc. BlackHat DC Conf. Hamm-Grams: An Algorithm for Mining Regular Expressions of Bytes 15

  44. [44]

    Weisstein

    Eric W. Weisstein. [n. d.]. Random Walk–2-Dimensional. https://mathworld.wolfram.com/RandomWalk2-Dimensional.html

  45. [45]

    Miuyin Yong Wong, Matthew Landen, Manos Antonakakis, Douglas M Blough, Elissa M Redmiles, and Mustaque Ahamad. 2021. An Inside Look into the Practice of Malware Analysis. InProceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security (CCS ’21). Association for Computing Machinery, New York, NY, USA, 3053–3069. doi:10.1145/3460120.3484759

  46. [46]

    Richard Zak, Edward Raff, and Charles Nicholas. 2017. What can N-grams learn for malware detection?. In2017 12th International Conference on Malicious and Unwanted Software (MALW ARE). IEEE, 109–118. doi:10.1109/MALWARE.2017.8323963