pith. sign in

arxiv: 2606.06117 · v1 · pith:PFD4RTPHnew · submitted 2026-06-04 · 🧬 q-bio.QM · cs.LG· math.AT· q-bio.GN

p-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

Pith reviewed 2026-06-27 22:44 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LGmath.ATq-bio.GN
keywords p-adic numberstopological data analysisgenomic sequence classificationalignment-free methodsVietoris-Rips complexbi-filtrationmachine learningk-mer analysis
0
0 comments X

The pith

pVR encodes genomic sequences via a bi-filtration of p-adic prefix distances and k-mer frequency distances to extract topological features for classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents pVR as an alignment-free method that represents each DNA sequence by two distances: a p-adic distance on ordered k-mer prefixes that encodes hierarchical positional information and an L1 distance on k-mer counts that encodes compositional content. These distances together define a bi-filtered Vietoris-Rips complex whose persistent homology summaries become input features for ordinary classifiers. Theoretical results prove stability of the summaries under small metric changes and invariance to the choice of prime, while showing that a single p-adic filtration produces trivial topology. On twelve genomic benchmarks containing 28 to 500 sequences, the resulting features raise accuracy over four standard alignment-free baselines on three of six low-sample tasks and exceed zero-shot embeddings from a 500-million-parameter transformer on three such tasks.

Core claim

pVR jointly parameterizes a bi-filtered Vietoris-Rips complex by a p-adic metric on k-mer prefixes and an L1 metric on k-mer frequencies; the persistent homology of this bi-filtration yields stable, prime-invariant topological summaries that serve as features improving classification accuracy on low-sample genomic datasets relative to alignment-free baselines.

What carries the argument

bi-filtered Vietoris-Rips complex jointly parameterized by p-adic distance on k-mer prefixes and L1 distance on k-mer frequencies

If this is right

  • Outperforms four alignment-free baselines on three of six low-sample datasets by as much as 21 percentage points.
  • Outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by 6.7 to 11.4 percentage points on three low-sample benchmarks.
  • The topological summaries remain stable under metric perturbations and invariant to the choice of prime p.
  • A single p-adic axis produces uninformative homology while the bi-filtration recovers nontrivial features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bi-filtration construction could be tested on other ordered sequential data such as protein sequences or parsed text where prefix hierarchies are present.
  • Accuracy may remain limited on problems with high rates of isolated substitutions, suggesting a need for hybrid filtrations that also capture pointwise differences.
  • In regimes where all methods saturate, the topological features might still reduce variance when combined with sequence kernels or learned embeddings.

Load-bearing premise

The p-adic distance on k-mer prefixes captures meaningful hierarchical positional structure in the genomic sequences.

What would settle it

Performance gains disappear or reverse on additional genomic datasets dominated by point-mutation divergence rather than hierarchical prefix structure, such as further SARS-CoV-2 variant collections.

Figures

Figures reproduced from arXiv: 2606.06117 by Gunja Sachdeva, Tirtharaj Dash.

Figure 1
Figure 1. Figure 1: Three snapshots of a Vietoris–Rips filtration on four points [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Proposition 8 on the five-point configuration {a, b, c, d, e} at (ϵp, ϵc) = (1, 0.4), where e is compositionally close to every point but p-adically far. (a) dp-only: tetrahedron on {a, b, c, d} with e isolated (β1 = 0). (b) dc-only: the 4-cycle a–c–b–d–a is filled through e (β1 = 0). (c) Bi-filtration: e’s edges are excluded, leaving the 4-cycle unfilled (β1 = 1). The cycle is invisible to… view at source ↗
Figure 3
Figure 3. Figure 3: The pVR pipeline. Each sequence passes through two branches: a p-adic (hierarchical) branch gives the distance matrix Dp, and a compositional (L1) branch gives Dc. The two matrices parameterise a bi-filtered Vietoris–Rips complex, from which per-sequence degree profiles are extracted; these, with the p-adic histograms and k-mer frequencies, form the feature vector for a standard classifier. The ∩ symbol de… view at source ↗
Figure 4
Figure 4. Figure 4: Cosine-UMAP projections of pVR features (left) and Nucleotide Transformer v2 zero-shot embeddings (right) on Influenza HA-small (N = 59, four subtypes). pVR produces visibly subtype-separated clus￾ters; NT v2 embeddings show only weak separation. Equivalent figures for the other two benchmarks are included within our code repository. Ablation. We classify with one feature group at a time, using XGBoost thr… view at source ↗
Figure 5
Figure 5. Figure 5: Distance matrices (left) and Betti heatmaps (right) for two low-sample datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single $p$-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ($28$ to $500$ sequences, $3$ to $7$ classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to $21$ percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by $6.7$ to $11.4$ percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces pVR, a topological machine learning framework for alignment-free genomic sequence classification. It encodes DNA sequences using a p-adic distance on k-mer prefixes (capturing hierarchical positional structure) and an L1 distance on k-mer frequencies (capturing compositional content), jointly parameterizing a bi-filtered Vietoris-Rips complex whose topological summaries serve as features for standard classifiers. Theoretical guarantees are claimed for stability under metric perturbations, invariance to prime choice, and the necessity of the bi-filtration (single p-adic axis being topologically trivial). On twelve genomic benchmarks (28-500 sequences, 3-7 classes), pVR outperforms four alignment-free baselines on three of six low-sample datasets (gains up to 21 pp), underperforms on a SARS-CoV-2 variant set violating the hierarchical assumption, saturates with larger samples, and beats zero-shot 500M-parameter Nucleotide Transformer embeddings by 6.7-11.4 pp on three low-sample tasks. Code is publicly available.

Significance. If the results hold, the work offers a novel integration of p-adic valuations into TDA for genomics that exploits hierarchical structure in sequences, with demonstrated utility in low-sample regimes where it surpasses both classical alignment-free methods and large frozen embeddings. The public codebase is a clear strength, enabling direct verification of the bi-filtration construction and empirical claims.

minor comments (3)
  1. The abstract and results section should explicitly reference the specific theorems or propositions establishing stability, prime invariance, and single-axis triviality (currently only asserted).
  2. Table or figure reporting the per-benchmark accuracies should include standard deviations or statistical significance tests to support the 'up to 21 percentage points' gains.
  3. The k-mer length k and filtration parameters are listed as free parameters; a brief sensitivity analysis or default selection protocol would strengthen the reproducibility claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary of our work and the positive evaluation of its significance, particularly regarding the integration of p-adic valuations with TDA for low-sample genomic classification and the public codebase. The recommendation for minor revision is noted. No major comments were provided in the report, so we have no points requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines a new p-adic distance on k-mer prefixes and an L1 distance on frequencies, jointly parameterizing a bi-filtered Vietoris-Rips complex whose topological summaries are used as features for off-the-shelf classifiers. Theoretical claims (stability, prime invariance, single-axis triviality) are stated as results of the construction itself rather than fitted or renamed inputs. No equations reduce a claimed prediction to a parameter fit by construction, and no load-bearing premise rests on self-citation chains. Public code supplies an independent verification path. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The approach builds on established mathematical structures from number theory and topology, with the main addition being the bi-filtration parameterized by the two distances.

free parameters (2)
  • k-mer length k
    Parameter for encoding sequences into k-mers, likely selected based on data.
  • filtration parameters
    Parameters for the bi-filtration construction.
axioms (2)
  • standard math p-adic numbers form a metric space suitable for hierarchical structures
    Invoked for the p-adic distance on k-mer prefixes.
  • standard math Vietoris-Rips complexes are stable under small perturbations of the metric
    Basis for the stability guarantee.
invented entities (1)
  • p-adic bi-filtration no independent evidence
    purpose: To jointly capture hierarchical positional and compositional information in sequences via two distances
    Newly introduced construction in the paper to address limitations of single-axis filtrations.

pith-pipeline@v0.9.1-grok · 5825 in / 1470 out tokens · 32183 ms · 2026-06-27T22:44:36.903938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages

  1. [1]

    Alignment-free sequence comparison: benefits, applications, and tools,

    A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski, “Alignment-free sequence comparison: benefits, applications, and tools,”Genome biology, vol. 18, no. 1, p. 186, 2017

  2. [2]

    Alignment- free genome comparison with feature frequency profiles (ffp) and optimal resolutions,

    G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim, “Alignment- free genome comparison with feature frequency profiles (ffp) and optimal resolutions,”Proceedings of the National Academy of Sciences, vol. 106, no. 8, pp. 2677–2682, 2009

  3. [3]

    A novel method of characterizing genetic sequences: genome space with biological distance and applications,

    M. Deng, C. Yu, Q. Liang, R. L. He, and S. S.-T. Yau, “A novel method of characterizing genetic sequences: genome space with biological distance and applications,”PloS one, vol. 6, no. 3, p. e17293, 2011. 12 TABLE 9 Sensitivity tok-mer size: XGBoost accuracy (%, mean±std over a single 5-fold split). The primepis invariant by Proposition 12. kLow-sample da...

  4. [4]

    Mash: fast genome and metagenome distance estimation using minhash,

    B. D. Ondov, T. J. Treangen, P . Melsted, A. B. Mallonee, N. H. Bergmanet al., “Mash: fast genome and metagenome distance estimation using minhash,”Genome biology, vol. 17, no. 1, p. 132, 2016

  5. [5]

    Topology of viral evolution,

    J. M. Chan, G. Carlsson, and R. Rabadan, “Topology of viral evolution,”Proceedings of the National Academy of Sciences, vol. 110, no. 46, pp. 18 566–18 571, 2013

  6. [6]

    Revealing the shape of genome space via k-mer topology,

    Y. Hozumi and G.-W. Wei, “Revealing the shape of genome space via k-mer topology,”arXiv preprint arXiv:2412.20202, 2024

  7. [7]

    Cakl: Commutative algebra k-mer learning of genomics,

    F. Suwayyid, Y. Hozumi, H. Feng, M. Zia, J. Wee, and G.-W. Wei, “Cakl: Commutative algebra k-mer learning of genomics,”arXiv preprint arXiv:2508.09406, 2025

  8. [8]

    p-adic modelling of the genome and the genetic code,

    B. Dragovich and A. Dragovich, “p-adic modelling of the genome and the genetic code,”The Computer Journal, vol. 53, no. 4, pp. 432–442, 2010

  9. [9]

    p-adic mathematics and theoretical biology,

    B. Dragovich, A. Y. Khrennikov, S. V . Kozyrev, and N. ˇZ. Mi ˇsi´c, “p-adic mathematics and theoretical biology,”Biosystems, vol. 199, p. 104288, 2021

  10. [10]

    Characterization, stability and convergence of hierarchical clustering methods

    G. E. Carlsson, F. M ´emoliet al., “Characterization, stability and convergence of hierarchical clustering methods.”J. Mach. Learn. Res., vol. 11, no. 47, pp. 1425–1470, 2010

  11. [11]

    Semple, M

    C. Semple, M. Steelet al.,Phylogenetics. Oxford University Press on Demand, 2003, vol. 24

  12. [12]

    DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,

    Y. Ji, Z. Zhou, H. Liu, and R. V . Davuluri, “DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,”Bioinformatics, vol. 37, no. 15, pp. 2112– 2120, 2021

  13. [13]

    Nucleotide transformer: building and evaluating robust foundation models for human genomics,

    H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. Lopez Car- ranzaet al., “Nucleotide transformer: building and evaluating robust foundation models for human genomics,”Nature Methods, vol. 22, no. 2, pp. 287–297, 2025

  14. [14]

    Dna language model grover learns sequence context in the human genome,

    M. Sanabria, J. Hirsch, P . M. Joubert, and A. R. Poetsch, “Dna language model grover learns sequence context in the human genome,”Nature Machine Intelligence, vol. 6, no. 8, pp. 911–923, 2024

  15. [15]

    Topological methods for genomics: present and future directions,

    P . G. C ´amara, “Topological methods for genomics: present and future directions,”Current opinion in systems biology, vol. 1, pp. 95–101, 2017

  16. [16]

    Rabadan and A

    R. Rabadan and A. J. Blumberg,Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press, 2019

  17. [17]

    Ultrametrics in the genetic code and the genome,

    B. Dragovich, A. Y. Khrennikov, and N. ˇZ. Mi ˇsi´c, “Ultrametrics in the genetic code and the genome,”Applied Mathematics and Computation, vol. 309, pp. 350–358, 2017

  18. [18]

    p-clustval: a novel p-adic approach for enhanced clustering of high- dimensional single-cell rnaseq data,

    P . Sharma, S. Mishra, H. Kurban, and M. Dalkilic, “p-clustval: a novel p-adic approach for enhanced clustering of high- dimensional single-cell rnaseq data,”International Journal of Data Science and Analytics, vol. 20, no. 4, pp. 4051–4066, 2025

  19. [19]

    v-punns: van der put neural networks for transparent ultrametric representation learning,

    G. L. R. N’guessan, “v-punns: van der put neural networks for transparent ultrametric representation learning,”arXiv preprint arXiv:2508.01010, 2025

  20. [20]

    Learning with thep-adics,

    A. F. Martins, “Learning with thep-adics,”arXiv preprint arXiv:2512.22692, 2025

  21. [21]

    Barcodes: the persistent topology of data,

    R. Ghrist, “Barcodes: the persistent topology of data,”Bulletin of the American Mathematical Society, vol. 45, no. 1, pp. 61–75, 2008

  22. [22]

    Topological persistence and simplification,

    H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,”Discrete & computational geometry, vol. 28, no. 4, pp. 511–533, 2002

  23. [23]

    Computing persistent homol- ogy,

    A. Zomorodian and G. Carlsson, “Computing persistent homol- ogy,” inProceedings of the twentieth annual symposium on Computa- tional geometry, 2004, pp. 347–356

  24. [24]

    The theory of multidimensional persistence,

    G. Carlsson and A. Zomorodian, “The theory of multidimensional persistence,” inProceedings of the twenty-third annual symposium on Computational geometry, 2007, pp. 184–193

  25. [25]

    The theory of the interleaving distance on multi- dimensional persistence modules,

    M. Lesnick, “The theory of the interleaving distance on multi- dimensional persistence modules,”Foundations of Computational Mathematics, vol. 15, no. 3, pp. 613–650, 2015

  26. [26]

    XGBoost: A scalable tree boosting system,

    T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785– 794

  27. [27]

    The gudhi library: Simplicial complexes and persistent homology,

    C. Maria, J.-D. Boissonnat, M. Glisse, and M. Yvinec, “The gudhi library: Simplicial complexes and persistent homology,” inInter- national congress on mathematical software. Springer, 2014, pp. 167– 174

  28. [28]

    Inference for the generalization error,

    C. Nadeau and Y. Bengio, “Inference for the generalization error,” Machine Learning, vol. 52, no. 3, pp. 239–281, Sep 2003. [Online]. Available: https://doi.org/10.1023/A:1024068626366

  29. [29]

    Evaluating the replicability of significance tests for comparing learning algorithms,

    R. R. Bouckaert and E. Frank, “Evaluating the replicability of significance tests for comparing learning algorithms,” inPacific- Asia conference on knowledge discovery and data mining. Springer, 2004, pp. 3–12

  30. [30]

    T. M. Mitchell,Machine Learning. McGraw-Hill, 1997

  31. [31]

    A model of inductive bias learning,

    J. Baxter, “A model of inductive bias learning,”Journal of artificial intelligence research, vol. 12, pp. 149–198, 2000

  32. [32]

    Biology- informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability,

    K. Kontolati, R. J. Gladstone, I. Davis, and E. Pickering, “Biology- informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability,” arXiv preprint arXiv:2510.14970, 2025

  33. [33]

    A review of some techniques for inclusion of domain-knowledge into deep neural networks,

    T. Dash, S. Chitlangia, A. Ahuja, and A. Srinivasan, “A review of some techniques for inclusion of domain-knowledge into deep neural networks,”Scientific Reports, vol. 12, no. 1, p. 1040, 2022

  34. [34]

    BIRDNet: Mining and encoding boolean implication knowledge graphs as interpretable deep neural networks,

    T. Dash, “BIRDNet: Mining and encoding boolean implication knowledge graphs as interpretable deep neural networks,”arXiv preprint arXiv:2605.28739, 2026

  35. [35]

    Consensus proposals for classification of the family hepeviridae,

    D. B. Smith, P . Simmonds, I. C. on the Taxonomy of Viruses Hepe- viridae Study Groupet al., “Consensus proposals for classification of the family hepeviridae,”Journal of General Virology, vol. 95, no. 10, pp. 2223–2232, 2014

  36. [36]

    Topological estimation using witness complexes

    V . De Silva and G. E. Carlsson, “Topological estimation using witness complexes.” inPBG, 2004, pp. 157–166

  37. [37]

    Persistence images: A stable vector representation of persistent homology,

    H. Adams, T. Emerson, M. Kirby, R. Neville, C. Petersonet al., “Persistence images: A stable vector representation of persistent homology,”Journal of Machine Learning Research, vol. 18, no. 8, pp. 1–35, 2017

  38. [38]

    Learning representa- tions of persistence barcodes,

    C. D. Hofer, R. Kwitt, and M. Niethammer, “Learning representa- tions of persistence barcodes,”Journal of Machine Learning Research, vol. 20, no. 126, pp. 1–45, 2019