Fast Iteration of Spaced k-mers

Lucas Czech

arxiv: 2603.25417 · v3 · submitted 2026-03-26 · 🧬 q-bio.GN · cs.DS

Fast Iteration of Spaced k-mers

Lucas Czech This is my paper

Pith reviewed 2026-05-15 06:55 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.DS

keywords spaced k-mersk-mer iterationbit manipulationsequence extractionbioinformaticshigh-performance algorithmsnucleotide processing

0 comments

The pith

Spaced k-mers can be extracted from nucleotide sequences at nearly the same speed as regular k-mers by using CPU bit manipulation instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a set of algorithms that pull spaced k-mers directly from DNA sequences with simple bit shifts and masks. These routines replace slower masking or loop-based methods and deliver up to 750 megabytes of sequence data per second on one core. The result removes the previous performance penalty that had limited spaced k-mers in large-scale tools. The authors also point out routine inefficiencies in k-mer code that waste cycles on most hardware.

Core claim

Spaced k-mers select only a subset of positions inside each window of length k, which increases tolerance to mismatches. The central advance is a family of bit-manipulation procedures that compute exactly those positions in a single pass over the input bits, without extra branching or table lookups.

What carries the argument

CPU bit manipulation instructions that shift and mask sequence bits to isolate the spaced positions directly.

If this is right

High-throughput pipelines for read mapping and metagenomic classification can adopt spaced k-mers without slowing down.
Error-robust spaced patterns become practical for genome assembly at full sequencing throughput.
Implementation effort drops because the code uses only standard CPU intrinsics instead of custom loops.
Common k-mer processing patterns that cause cache misses or branch mispredictions can be removed for further gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bit tricks could be adapted to extract spaced substrings in other string-processing domains beyond biology.
Wider SIMD instructions on newer chips would likely scale the throughput linearly for very long reads.
Embedding the iterators in existing libraries would let current metagenomic tools switch to spaced k-mers with minimal code changes.
Benchmarks on sequencing data with varying error rates would quantify how much the robustness benefit outweighs any remaining overhead.

Load-bearing premise

The reported speedups from bit operations remain consistent across typical genomic data, common k-mer sizes, and standard processor architectures without hidden overheads in real programs.

What would settle it

Measure the per-core throughput of the new iterator against a standard k-mer iterator on a large real genome file; a sustained gap larger than 20 percent would falsify the no-major-degradation claim.

Figures

Figures reproduced from arXiv: 2603.25417 by Lucas Czech.

**Figure 1.** Figure 1: Spaced k-mer extraction. (a) An input sequence of nucleotide characters. (b) Three consecutive k-mers of length 8 are extracted from the sequence, each overlapping by k − 1 = 7 characters, forming a sliding window of length k over the sequence. (c) A fixed mask m of span k = 8 and weight w = 5 is applied to each k-mer, to select characters where the mask is 1, while leaving out characters where the mask is… view at source ↗

**Figure 2.** Figure 2: Extracting spaced k-mers from a sequence. (a) Single mask. The time per w-mer is shown, i. e., total time divided by number of w-mers in the sequence, averaged across runs with distinct masks, for different hardware and compilers. The naive algorithm loops over the mask for each w-mer; the PEXT intrinsic is the fastest where available as a dedicated hardware instruction; the Butterfly algorithm and its SIM… view at source ↗

read the original abstract

Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative genomics. Spaced k-mers generalize this concept by selecting only a subset of positions within a k-mer, improving robustness to mismatches and sequencing errors. While k-mers are computationally highly efficient, spaced k-mers require additional work to be extracted from a sequence, which has slowed down existing methods. Results: We present a collection of efficient algorithms for extracting spaced k-mers from nucleotide sequences, optimized for different hardware architectures. They are based on bit manipulation instructions at CPU level, making them both simpler to implement and up to an order of magnitude faster than existing methods. We further evaluate common pitfalls in k-mer processing, which can cause substantial inefficiencies. Conclusions: Our approaches allow the utilization of spaced k-mers in high-performance bioinformatics applications without major performance degradation compared to regular k-mers, achieving a throughput of up to 750MB of sequence data per second per core. Availability: The implementation in C++20 is published under the MIT license, and freely available at https://github.com/lczech/fisk

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete bit-manipulation routines that pull spaced k-mers at up to 750 MB/s per core with little slowdown versus regular k-mers.

read the letter

The main takeaway is that these new algorithms use CPU bit instructions to iterate spaced k-mers faster and more simply than earlier methods, with reported throughput that stays competitive on nucleotide data. The work focuses on practical extraction from sequences, which has been a bottleneck for spaced k-mers in indexing and mapping tasks. They supply several architecture-tuned variants and flag common inefficiencies like branching or cache misses that hurt performance in k-mer code. The MIT-licensed C++20 implementation at the linked repo is a plus for anyone who wants to test it directly. The abstract backs the speed claims with specific numbers, and the stress-test found no internal contradictions or untested assumptions that would break the central result on its own terms. That said, the evidence for broad applicability rests on the benchmarks described, and without full details on sequence variety, pattern dependence, or hardware spread, the generalization to every real pipeline stays moderate. This is aimed at bioinformatics developers who already use or want to use spaced k-mers in high-volume settings. The algorithmic focus and released code make it worth a look for anyone building tools in that area. I would send it to peer review because the performance numbers are concrete and the contribution is narrow but usable.

Referee Report

2 major / 3 minor

Summary. The paper presents a collection of bit-manipulation algorithms for extracting spaced k-mers from nucleotide sequences, optimized across CPU architectures. It claims these methods are up to an order of magnitude faster than prior approaches and achieve throughputs of up to 750 MB of sequence data per second per core with no major performance degradation relative to contiguous k-mers, supported by an open-source MIT-licensed C++20 implementation.

Significance. If the reported instruction-level speedups translate to end-to-end throughput on representative data, the work removes a key barrier to adopting spaced k-mers in high-performance pipelines such as read mapping, assembly, and metagenomic classification. The public code repository is a concrete strength that supports immediate verification and reuse.

major comments (2)

[Results] Results section (throughput claims): The central performance assertion of up to 750 MB/s per core and 'no major degradation' versus regular k-mers is presented without error bars, detailed hardware specifications, sequence composition statistics, or per-pattern timing tables in the main text; this leaves the load-bearing claim only moderately supported by the abstract numbers alone.
[Results] Evaluation of common pitfalls: The discussion of inefficiencies in k-mer processing is useful but lacks quantitative ablation showing how the new bit-manipulation routines avoid the identified pitfalls (e.g., branch mispredictions or cache effects) across the tested spaced patterns.

minor comments (3)

[Methods] Methods: Provide explicit pseudocode or C++ snippets for the core bit-manipulation kernels (e.g., the shift-and-mask sequences for a representative spaced pattern) to improve reproducibility without requiring inspection of the repository.
[Figures] Figure captions: Ensure all figures reporting throughput include the exact k, weight, and pattern strings used, as well as the compiler and CPU model.
[Availability] Availability statement: The GitHub link is given, but the manuscript should state the commit hash or release tag corresponding to the benchmarked version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive summary, the recommendation of minor revision, and the constructive comments on the results presentation. We address each major comment below and will incorporate the suggested improvements.

read point-by-point responses

Referee: [Results] Results section (throughput claims): The central performance assertion of up to 750 MB/s per core and 'no major degradation' versus regular k-mers is presented without error bars, detailed hardware specifications, sequence composition statistics, or per-pattern timing tables in the main text; this leaves the load-bearing claim only moderately supported by the abstract numbers alone.

Authors: We agree that the central throughput claims would be more robustly supported with additional details in the main text. In the revised manuscript we will add the CPU model and clock frequency, error bars computed from repeated measurements, sequence composition statistics for the benchmark data, and a per-pattern timing table directly in the Results section. These values were recorded during the original experiments and can be included without new runs. revision: yes
Referee: [Results] Evaluation of common pitfalls: The discussion of inefficiencies in k-mer processing is useful but lacks quantitative ablation showing how the new bit-manipulation routines avoid the identified pitfalls (e.g., branch mispredictions or cache effects) across the tested spaced patterns.

Authors: We acknowledge that a quantitative ablation would strengthen the evaluation of the pitfalls. We will add hardware-counter measurements (branch mispredictions and cache misses) for both the new routines and the baseline methods across the tested spaced patterns, presented in a new figure or subsection of the revised Results. This will directly quantify how the bit-manipulation approach reduces these inefficiencies. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on algorithmic design and measurements

full rationale

The paper introduces bit-manipulation algorithms for spaced k-mer extraction and reports empirical throughput (up to 750 MB/s per core). No load-bearing step reduces by construction to its own inputs, fitted parameters, or self-citation chains. Claims are grounded in explicit algorithmic descriptions and benchmark results that are externally falsifiable via the provided MIT-licensed code, satisfying the criteria for a self-contained, non-circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work relies on standard bit manipulation operations and existing k-mer concepts from prior literature.

pith-pipeline@v0.9.0 · 5511 in / 981 out tokens · 34194 ms · 2026-05-15T06:55:06.117364+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Marchet C (2024) Advances in colored k-mer sets: essentials for the curious.arXiv [q-bio.GN]

work page 2024
[2]

Marchet C (2024) Advances in practical k-mer sets: essentials for the curious.arXiv [q-bio.GN]

work page 2024
[3]

(2024) A survey of k-mer methods and applications in bioinformatics

Moeckel C, et al. (2024) A survey of k-mer methods and applications in bioinformatics. Computational and structural biotechnology journal23:2289–2303

work page 2024
[4]

(Springer Berlin Heidelberg, Berlin, Heidelberg), pp

Burkhardt S, Kärkkäinen J (2001) Better filtering with gapped q-grams inCombinatorial Pattern Matching, Lecture Notes in Computer Science. (Springer Berlin Heidelberg, Berlin, Heidelberg), pp. 73–85

work page 2001
[5]

Bioinformatics (Oxford, England)18(3):440–445

Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England)18(3):440–445

work page 2002
[6]

Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B (2016) Rasbhari: Optimiz- ing spaced seeds for database searching, read mapping and alignment-free sequence comparison.PLoS computational biology12(10):e1005107

work page 2016
[7]

Kucherov G, Noé L, Roytberg M (2006) A unifying framework for seed sensitivity and its application to subset seeds.Journal of Bioinformatics and Computational Biology 4(2):553–569

work page 2006
[8]

Noé L (2017) Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.Algorithms for Molecular Biology12(1):1

work page 2017
[11]

Petrucci E, Noé L, Pizzi C, Comin M (2020) Iterative Spaced Seed Hashing: Closing the gap between spaced seed hashing and k-mer hashing.Journal of computational biology: a journal of computational molecular cell biology27(2):223–233

work page 2020
[12]

IEEE/ACM transactions on computational biology and bioinformatics21(6):2330–2339

Mian E, Petrucci E, Pizzi C, Comin M (2024) MISSH: Fast hashing of multiple spaced seeds. IEEE/ACM transactions on computational biology and bioinformatics21(6):2330–2339

work page 2024
[13]

(Springer Nature Switzerland, Cham), pp

Gemin L, Pizzi C, Comin M (2026) DuoHash: Fast hashing of spaced seeds with application to spaced K-mers counting inLecture Notes in Computer Science, Lecture Notes in Computer Science. (Springer Nature Switzerland, Cham), pp. 28–39

work page 2026
[14]

Czech L, Barbera P , Stamatakis A (2020) Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data.Bioinformaticsp. 647958

work page 2020
[17]

(2015) CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.BMC genomics16(1):236

Ounit R, et al. (2015) CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.BMC genomics16(1):236

work page 2015
[18]

Bioinformatics32(22):3492–3494

Mohamadi H, Chu J, Vandervalk BP , Birol I (2016) ntHash: recursive nucleotide hashing. Bioinformatics32(22):3492–3494

work page 2016
[19]

(2022) ntHash2: recursive spaced seed hashing for nucleotide sequences

Kazemi P , et al. (2022) ntHash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics38(20):4812–4813

work page 2022
[20]

re-extract

Häntze H, Horton P (2023) Effects of spaced k-mers on alignment-free genotyping.Bioin- formatics (Oxford, England)39(39 Suppl 1):i213–i221. March 27, 2026|4 Supplementary Text: Fast Iteration of Spaced k-mers Lucas Czech Section for GeoGenetics, Globe Institute, University of Copenhagen, Denmark. Contents 1 Overview 2 1.1 Notation . . . . . . . . . . . . ...

work page 2023
[21]

Mian E, Petrucci E, Pizzi C, Comin M (2024) MISSH: Fast hashing of multiple spaced seeds.IEEE/ACM transactions on computational biology and bioinformatics21(6):2330– 2339

work page 2024
[22]

Corontzos C, Frachtenberg E (2024) Direct-coding DNA with multilevel parallelism.IEEE computer architecture letters23(1):21–24

work page 2024
[23]

Girotto S, Comin M, Pizzi C (2018) FSH: fast spaced seed hashing exploiting adjacent hashes.Algorithms for molecular biology: AMB13(1):8

work page 2018
[24]

Girotto S, Comin M, Pizzi C (2018) Efficient computation of spaced seed hashing with block indexing.BMC bioinformatics19(Suppl 15):441

work page 2018
[25]

Petrucci E, No´ e L, Pizzi C, Comin M (2020) Iterative Spaced Seed Hashing: Closing the gap between spaced seed hashing and k-mer hashing.Journal of computational biology: a journal of computational molecular cell biology27(2):223–233

work page 2020
[26]

(Springer Nature Switzerland, Cham), pp

Gemin L, Pizzi C, Comin M (2026) DuoHash: Fast hashing of spaced seeds with applica- tion to spaced K-mers counting inLecture Notes in Computer Science, Lecture Notes in Computer Science. (Springer Nature Switzerland, Cham), pp. 28–39

work page 2026
[27]

Ounit R, Lonardi S (2016) Higher classification sensitivity of short metagenomic reads with CLARK-S.Bioinformatics (Oxford, England)32(24):3823–3825

work page 2016
[28]

Weinstein CJ (1969) Quantization effects in digital filters, Technical report

work page 1969
[29]

Bioinformatics32(22):3492–3494

Mohamadi H, Chu J, Vandervalk BP, Birol I (2016) ntHash: recursive nucleotide hashing. Bioinformatics32(22):3492–3494

work page 2016
[30]

(2022) ntHash2: recursive spaced seed hashing for nucleotide sequences

Kazemi P, et al. (2022) ntHash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics38(20):4812–4813. 19

work page 2022

[1] [1]

Marchet C (2024) Advances in colored k-mer sets: essentials for the curious.arXiv [q-bio.GN]

work page 2024

[2] [2]

Marchet C (2024) Advances in practical k-mer sets: essentials for the curious.arXiv [q-bio.GN]

work page 2024

[3] [3]

(2024) A survey of k-mer methods and applications in bioinformatics

Moeckel C, et al. (2024) A survey of k-mer methods and applications in bioinformatics. Computational and structural biotechnology journal23:2289–2303

work page 2024

[4] [4]

(Springer Berlin Heidelberg, Berlin, Heidelberg), pp

Burkhardt S, Kärkkäinen J (2001) Better filtering with gapped q-grams inCombinatorial Pattern Matching, Lecture Notes in Computer Science. (Springer Berlin Heidelberg, Berlin, Heidelberg), pp. 73–85

work page 2001

[5] [5]

Bioinformatics (Oxford, England)18(3):440–445

Ma B, Tromp J, Li M (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics (Oxford, England)18(3):440–445

work page 2002

[6] [6]

Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B (2016) Rasbhari: Optimiz- ing spaced seeds for database searching, read mapping and alignment-free sequence comparison.PLoS computational biology12(10):e1005107

work page 2016

[7] [7]

Kucherov G, Noé L, Roytberg M (2006) A unifying framework for seed sensitivity and its application to subset seeds.Journal of Bioinformatics and Computational Biology 4(2):553–569

work page 2006

[8] [8]

Noé L (2017) Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds.Algorithms for Molecular Biology12(1):1

work page 2017

[9] [11]

Petrucci E, Noé L, Pizzi C, Comin M (2020) Iterative Spaced Seed Hashing: Closing the gap between spaced seed hashing and k-mer hashing.Journal of computational biology: a journal of computational molecular cell biology27(2):223–233

work page 2020

[10] [12]

IEEE/ACM transactions on computational biology and bioinformatics21(6):2330–2339

Mian E, Petrucci E, Pizzi C, Comin M (2024) MISSH: Fast hashing of multiple spaced seeds. IEEE/ACM transactions on computational biology and bioinformatics21(6):2330–2339

work page 2024

[11] [13]

(Springer Nature Switzerland, Cham), pp

Gemin L, Pizzi C, Comin M (2026) DuoHash: Fast hashing of spaced seeds with application to spaced K-mers counting inLecture Notes in Computer Science, Lecture Notes in Computer Science. (Springer Nature Switzerland, Cham), pp. 28–39

work page 2026

[12] [14]

Czech L, Barbera P , Stamatakis A (2020) Genesis and Gappa: processing, analyzing and visualizing phylogenetic (placement) data.Bioinformaticsp. 647958

work page 2020

[13] [17]

(2015) CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.BMC genomics16(1):236

Ounit R, et al. (2015) CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.BMC genomics16(1):236

work page 2015

[14] [18]

Bioinformatics32(22):3492–3494

Mohamadi H, Chu J, Vandervalk BP , Birol I (2016) ntHash: recursive nucleotide hashing. Bioinformatics32(22):3492–3494

work page 2016

[15] [19]

(2022) ntHash2: recursive spaced seed hashing for nucleotide sequences

Kazemi P , et al. (2022) ntHash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics38(20):4812–4813

work page 2022

[16] [20]

re-extract

Häntze H, Horton P (2023) Effects of spaced k-mers on alignment-free genotyping.Bioin- formatics (Oxford, England)39(39 Suppl 1):i213–i221. March 27, 2026|4 Supplementary Text: Fast Iteration of Spaced k-mers Lucas Czech Section for GeoGenetics, Globe Institute, University of Copenhagen, Denmark. Contents 1 Overview 2 1.1 Notation . . . . . . . . . . . . ...

work page 2023

[17] [21]

Mian E, Petrucci E, Pizzi C, Comin M (2024) MISSH: Fast hashing of multiple spaced seeds.IEEE/ACM transactions on computational biology and bioinformatics21(6):2330– 2339

work page 2024

[18] [22]

Corontzos C, Frachtenberg E (2024) Direct-coding DNA with multilevel parallelism.IEEE computer architecture letters23(1):21–24

work page 2024

[19] [23]

Girotto S, Comin M, Pizzi C (2018) FSH: fast spaced seed hashing exploiting adjacent hashes.Algorithms for molecular biology: AMB13(1):8

work page 2018

[20] [24]

Girotto S, Comin M, Pizzi C (2018) Efficient computation of spaced seed hashing with block indexing.BMC bioinformatics19(Suppl 15):441

work page 2018

[21] [25]

Petrucci E, No´ e L, Pizzi C, Comin M (2020) Iterative Spaced Seed Hashing: Closing the gap between spaced seed hashing and k-mer hashing.Journal of computational biology: a journal of computational molecular cell biology27(2):223–233

work page 2020

[22] [26]

(Springer Nature Switzerland, Cham), pp

Gemin L, Pizzi C, Comin M (2026) DuoHash: Fast hashing of spaced seeds with applica- tion to spaced K-mers counting inLecture Notes in Computer Science, Lecture Notes in Computer Science. (Springer Nature Switzerland, Cham), pp. 28–39

work page 2026

[23] [27]

Ounit R, Lonardi S (2016) Higher classification sensitivity of short metagenomic reads with CLARK-S.Bioinformatics (Oxford, England)32(24):3823–3825

work page 2016

[24] [28]

Weinstein CJ (1969) Quantization effects in digital filters, Technical report

work page 1969

[25] [29]

Bioinformatics32(22):3492–3494

Mohamadi H, Chu J, Vandervalk BP, Birol I (2016) ntHash: recursive nucleotide hashing. Bioinformatics32(22):3492–3494

work page 2016

[26] [30]

(2022) ntHash2: recursive spaced seed hashing for nucleotide sequences

Kazemi P, et al. (2022) ntHash2: recursive spaced seed hashing for nucleotide sequences. Bioinformatics38(20):4812–4813. 19

work page 2022