$p$-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

Gunja Sachdeva; Tirtharaj Dash

arxiv: 2606.06117 · v1 · pith:PFD4RTPHnew · submitted 2026-06-04 · 🧬 q-bio.QM · cs.LG· math.AT· q-bio.GN

p-adic Bi-Filtrations for Topological Machine Learning on Genomic Sequences

Tirtharaj Dash , Gunja Sachdeva This is my paper

Pith reviewed 2026-06-27 22:44 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.LGmath.ATq-bio.GN

keywords p-adic numberstopological data analysisgenomic sequence classificationalignment-free methodsVietoris-Rips complexbi-filtrationmachine learningk-mer analysis

0 comments

The pith

pVR encodes genomic sequences via a bi-filtration of p-adic prefix distances and k-mer frequency distances to extract topological features for classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents pVR as an alignment-free method that represents each DNA sequence by two distances: a p-adic distance on ordered k-mer prefixes that encodes hierarchical positional information and an L1 distance on k-mer counts that encodes compositional content. These distances together define a bi-filtered Vietoris-Rips complex whose persistent homology summaries become input features for ordinary classifiers. Theoretical results prove stability of the summaries under small metric changes and invariance to the choice of prime, while showing that a single p-adic filtration produces trivial topology. On twelve genomic benchmarks containing 28 to 500 sequences, the resulting features raise accuracy over four standard alignment-free baselines on three of six low-sample tasks and exceed zero-shot embeddings from a 500-million-parameter transformer on three such tasks.

Core claim

pVR jointly parameterizes a bi-filtered Vietoris-Rips complex by a p-adic metric on k-mer prefixes and an L1 metric on k-mer frequencies; the persistent homology of this bi-filtration yields stable, prime-invariant topological summaries that serve as features improving classification accuracy on low-sample genomic datasets relative to alignment-free baselines.

What carries the argument

bi-filtered Vietoris-Rips complex jointly parameterized by p-adic distance on k-mer prefixes and L1 distance on k-mer frequencies

If this is right

Outperforms four alignment-free baselines on three of six low-sample datasets by as much as 21 percentage points.
Outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by 6.7 to 11.4 percentage points on three low-sample benchmarks.
The topological summaries remain stable under metric perturbations and invariant to the choice of prime p.
A single p-adic axis produces uninformative homology while the bi-filtration recovers nontrivial features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bi-filtration construction could be tested on other ordered sequential data such as protein sequences or parsed text where prefix hierarchies are present.
Accuracy may remain limited on problems with high rates of isolated substitutions, suggesting a need for hybrid filtrations that also capture pointwise differences.
In regimes where all methods saturate, the topological features might still reduce variance when combined with sequence kernels or learned embeddings.

Load-bearing premise

The p-adic distance on k-mer prefixes captures meaningful hierarchical positional structure in the genomic sequences.

What would settle it

Performance gains disappear or reverse on additional genomic datasets dominated by point-mutation divergence rather than hierarchical prefix structure, such as further SARS-CoV-2 variant collections.

Figures

Figures reproduced from arXiv: 2606.06117 by Gunja Sachdeva, Tirtharaj Dash.

**Figure 2.** Figure 2: Illustration of Proposition 8 on the five-point configuration {a, b, c, d, e} at (ϵp, ϵc) = (1, 0.4), where e is compositionally close to every point but p-adically far. (a) dp-only: tetrahedron on {a, b, c, d} with e isolated (β1 = 0). (b) dc-only: the 4-cycle a–c–b–d–a is filled through e (β1 = 0). (c) Bi-filtration: e’s edges are excluded, leaving the 4-cycle unfilled (β1 = 1). The cycle is invisible to… view at source ↗

**Figure 3.** Figure 3: The pVR pipeline. Each sequence passes through two branches: a p-adic (hierarchical) branch gives the distance matrix Dp, and a compositional (L1) branch gives Dc. The two matrices parameterise a bi-filtered Vietoris–Rips complex, from which per-sequence degree profiles are extracted; these, with the p-adic histograms and k-mer frequencies, form the feature vector for a standard classifier. The ∩ symbol de… view at source ↗

**Figure 4.** Figure 4: Cosine-UMAP projections of pVR features (left) and Nucleotide Transformer v2 zero-shot embeddings (right) on Influenza HA-small (N = 59, four subtypes). pVR produces visibly subtype-separated clusters; NT v2 embeddings show only weak separation. Equivalent figures for the other two benchmarks are included within our code repository. Ablation. We classify with one feature group at a time, using XGBoost thr… view at source ↗

**Figure 5.** Figure 5: Distance matrices (left) and Betti heatmaps (right) for two low-sample datasets. [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

We introduce pVR, a topological machine learning framework for alignment-free genomic sequence classification that combines $p$-adic numbers with topological data analysis. Each DNA sequence is encoded along two complementary axes: a $p$-adic distance on $k$-mer prefixes, which captures hierarchical positional structure, and a compositional $L_1$ distance on $k$-mer frequencies, which captures local sequence content. The two distances jointly parameterise a bi-filtered Vietoris--Rips complex, and per-sequence topological summaries from this bi-filtration serve as features for standard machine learning classifiers. We establish theoretical guarantees for the construction: stability under metric perturbations and invariance to the choice of prime, alongside a result that explains why a single $p$-adic axis is topologically uninformative and why the bi-filtration recovers nontrivial homology. On twelve genomic benchmarks ($28$ to $500$ sequences, $3$ to $7$ classes), pVR outperforms four established alignment-free baselines on three of six low-sample datasets, with gains of up to $21$ percentage points; it underperforms only on a SARS-CoV-2 variant benchmark whose point-mutation divergence violates the hierarchical assumption, and all methods saturate in the large-sample regime. pVR also outperforms zero-shot frozen embeddings from the 500M-parameter Nucleotide Transformer v2 by $6.7$ to $11.4$ percentage points on three low-sample benchmarks. The pVR codebase is publicly available at https://github.com/MAHI-Group/pVR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

pVR's bi-filtration pairs p-adic prefix distances with L1 frequencies to get nontrivial topology and gains on small genomic datasets, with code out and the failure case acknowledged.

read the letter

The new piece is the bi-filtration using p-adic distances on k-mer prefixes together with L1 on frequencies. They prove that a single p-adic axis gives trivial homology and that the pair recovers nontrivial persistent features, plus stability under perturbations and invariance to the prime. That construction is distinct from prior single-metric TDA work on sequences.

On the experiments, they report wins over four alignment-free baselines on three of six low-sample benchmarks, with lifts up to 21 points, and they beat zero-shot embeddings from the 500M-parameter Nucleotide Transformer on three of those sets. The code is public, which is useful for checking. They are direct about the drop on the SARS-CoV-2 variant set, where point mutations violate the hierarchical assumption, and they note that all methods saturate once sample sizes increase.

The soft spots are the modest benchmark sizes overall and the dependence on the hierarchy holding, though both are flagged in the abstract. The theoretical guarantees are asserted cleanly but would need the derivations inspected in review. No circularity or hidden fitting shows up.

This is for readers working on topological methods for short biological sequences or alignment-free classification. The idea is specific, the results are reported with the right caveats, and the open code supplies an independent check. It deserves a serious referee rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The paper introduces pVR, a topological machine learning framework for alignment-free genomic sequence classification. It encodes DNA sequences using a p-adic distance on k-mer prefixes (capturing hierarchical positional structure) and an L1 distance on k-mer frequencies (capturing compositional content), jointly parameterizing a bi-filtered Vietoris-Rips complex whose topological summaries serve as features for standard classifiers. Theoretical guarantees are claimed for stability under metric perturbations, invariance to prime choice, and the necessity of the bi-filtration (single p-adic axis being topologically trivial). On twelve genomic benchmarks (28-500 sequences, 3-7 classes), pVR outperforms four alignment-free baselines on three of six low-sample datasets (gains up to 21 pp), underperforms on a SARS-CoV-2 variant set violating the hierarchical assumption, saturates with larger samples, and beats zero-shot 500M-parameter Nucleotide Transformer embeddings by 6.7-11.4 pp on three low-sample tasks. Code is publicly available.

Significance. If the results hold, the work offers a novel integration of p-adic valuations into TDA for genomics that exploits hierarchical structure in sequences, with demonstrated utility in low-sample regimes where it surpasses both classical alignment-free methods and large frozen embeddings. The public codebase is a clear strength, enabling direct verification of the bi-filtration construction and empirical claims.

minor comments (3)

The abstract and results section should explicitly reference the specific theorems or propositions establishing stability, prime invariance, and single-axis triviality (currently only asserted).
Table or figure reporting the per-benchmark accuracies should include standard deviations or statistical significance tests to support the 'up to 21 percentage points' gains.
The k-mer length k and filtration parameters are listed as free parameters; a brief sensitivity analysis or default selection protocol would strengthen the reproducibility claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary of our work and the positive evaluation of its significance, particularly regarding the integration of p-adic valuations with TDA for low-sample genomic classification and the public codebase. The recommendation for minor revision is noted. No major comments were provided in the report, so we have no points requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper defines a new p-adic distance on k-mer prefixes and an L1 distance on frequencies, jointly parameterizing a bi-filtered Vietoris-Rips complex whose topological summaries are used as features for off-the-shelf classifiers. Theoretical claims (stability, prime invariance, single-axis triviality) are stated as results of the construction itself rather than fitted or renamed inputs. No equations reduce a claimed prediction to a parameter fit by construction, and no load-bearing premise rests on self-citation chains. Public code supplies an independent verification path. The central claim therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The approach builds on established mathematical structures from number theory and topology, with the main addition being the bi-filtration parameterized by the two distances.

free parameters (2)

k-mer length k
Parameter for encoding sequences into k-mers, likely selected based on data.
filtration parameters
Parameters for the bi-filtration construction.

axioms (2)

standard math p-adic numbers form a metric space suitable for hierarchical structures
Invoked for the p-adic distance on k-mer prefixes.
standard math Vietoris-Rips complexes are stable under small perturbations of the metric
Basis for the stability guarantee.

invented entities (1)

p-adic bi-filtration no independent evidence
purpose: To jointly capture hierarchical positional and compositional information in sequences via two distances
Newly introduced construction in the paper to address limitations of single-axis filtrations.

pith-pipeline@v0.9.1-grok · 5825 in / 1470 out tokens · 32183 ms · 2026-06-27T22:44:36.903938+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages

[1]

Alignment-free sequence comparison: benefits, applications, and tools,

A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski, “Alignment-free sequence comparison: benefits, applications, and tools,”Genome biology, vol. 18, no. 1, p. 186, 2017

2017
[2]

Alignment- free genome comparison with feature frequency profiles (ffp) and optimal resolutions,

G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim, “Alignment- free genome comparison with feature frequency profiles (ffp) and optimal resolutions,”Proceedings of the National Academy of Sciences, vol. 106, no. 8, pp. 2677–2682, 2009

2009
[3]

A novel method of characterizing genetic sequences: genome space with biological distance and applications,

M. Deng, C. Yu, Q. Liang, R. L. He, and S. S.-T. Yau, “A novel method of characterizing genetic sequences: genome space with biological distance and applications,”PloS one, vol. 6, no. 3, p. e17293, 2011. 12 TABLE 9 Sensitivity tok-mer size: XGBoost accuracy (%, mean±std over a single 5-fold split). The primepis invariant by Proposition 12. kLow-sample da...

2011
[4]

Mash: fast genome and metagenome distance estimation using minhash,

B. D. Ondov, T. J. Treangen, P . Melsted, A. B. Mallonee, N. H. Bergmanet al., “Mash: fast genome and metagenome distance estimation using minhash,”Genome biology, vol. 17, no. 1, p. 132, 2016

2016
[5]

Topology of viral evolution,

J. M. Chan, G. Carlsson, and R. Rabadan, “Topology of viral evolution,”Proceedings of the National Academy of Sciences, vol. 110, no. 46, pp. 18 566–18 571, 2013

2013
[6]

Revealing the shape of genome space via k-mer topology,

Y. Hozumi and G.-W. Wei, “Revealing the shape of genome space via k-mer topology,”arXiv preprint arXiv:2412.20202, 2024

arXiv 2024
[7]

Cakl: Commutative algebra k-mer learning of genomics,

F. Suwayyid, Y. Hozumi, H. Feng, M. Zia, J. Wee, and G.-W. Wei, “Cakl: Commutative algebra k-mer learning of genomics,”arXiv preprint arXiv:2508.09406, 2025

arXiv 2025
[8]

p-adic modelling of the genome and the genetic code,

B. Dragovich and A. Dragovich, “p-adic modelling of the genome and the genetic code,”The Computer Journal, vol. 53, no. 4, pp. 432–442, 2010

2010
[9]

p-adic mathematics and theoretical biology,

B. Dragovich, A. Y. Khrennikov, S. V . Kozyrev, and N. ˇZ. Mi ˇsi´c, “p-adic mathematics and theoretical biology,”Biosystems, vol. 199, p. 104288, 2021

2021
[10]

Characterization, stability and convergence of hierarchical clustering methods

G. E. Carlsson, F. M ´emoliet al., “Characterization, stability and convergence of hierarchical clustering methods.”J. Mach. Learn. Res., vol. 11, no. 47, pp. 1425–1470, 2010

2010
[11]

Semple, M

C. Semple, M. Steelet al.,Phylogenetics. Oxford University Press on Demand, 2003, vol. 24

2003
[12]

DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,

Y. Ji, Z. Zhou, H. Liu, and R. V . Davuluri, “DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,”Bioinformatics, vol. 37, no. 15, pp. 2112– 2120, 2021

2021
[13]

Nucleotide transformer: building and evaluating robust foundation models for human genomics,

H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. Lopez Car- ranzaet al., “Nucleotide transformer: building and evaluating robust foundation models for human genomics,”Nature Methods, vol. 22, no. 2, pp. 287–297, 2025

2025
[14]

Dna language model grover learns sequence context in the human genome,

M. Sanabria, J. Hirsch, P . M. Joubert, and A. R. Poetsch, “Dna language model grover learns sequence context in the human genome,”Nature Machine Intelligence, vol. 6, no. 8, pp. 911–923, 2024

2024
[15]

Topological methods for genomics: present and future directions,

P . G. C ´amara, “Topological methods for genomics: present and future directions,”Current opinion in systems biology, vol. 1, pp. 95–101, 2017

2017
[16]

Rabadan and A

R. Rabadan and A. J. Blumberg,Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press, 2019

2019
[17]

Ultrametrics in the genetic code and the genome,

B. Dragovich, A. Y. Khrennikov, and N. ˇZ. Mi ˇsi´c, “Ultrametrics in the genetic code and the genome,”Applied Mathematics and Computation, vol. 309, pp. 350–358, 2017

2017
[18]

p-clustval: a novel p-adic approach for enhanced clustering of high- dimensional single-cell rnaseq data,

P . Sharma, S. Mishra, H. Kurban, and M. Dalkilic, “p-clustval: a novel p-adic approach for enhanced clustering of high- dimensional single-cell rnaseq data,”International Journal of Data Science and Analytics, vol. 20, no. 4, pp. 4051–4066, 2025

2025
[19]

v-punns: van der put neural networks for transparent ultrametric representation learning,

G. L. R. N’guessan, “v-punns: van der put neural networks for transparent ultrametric representation learning,”arXiv preprint arXiv:2508.01010, 2025

arXiv 2025
[20]

Learning with thep-adics,

A. F. Martins, “Learning with thep-adics,”arXiv preprint arXiv:2512.22692, 2025

arXiv 2025
[21]

Barcodes: the persistent topology of data,

R. Ghrist, “Barcodes: the persistent topology of data,”Bulletin of the American Mathematical Society, vol. 45, no. 1, pp. 61–75, 2008

2008
[22]

Topological persistence and simplification,

H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,”Discrete & computational geometry, vol. 28, no. 4, pp. 511–533, 2002

2002
[23]

Computing persistent homol- ogy,

A. Zomorodian and G. Carlsson, “Computing persistent homol- ogy,” inProceedings of the twentieth annual symposium on Computa- tional geometry, 2004, pp. 347–356

2004
[24]

The theory of multidimensional persistence,

G. Carlsson and A. Zomorodian, “The theory of multidimensional persistence,” inProceedings of the twenty-third annual symposium on Computational geometry, 2007, pp. 184–193

2007
[25]

The theory of the interleaving distance on multi- dimensional persistence modules,

M. Lesnick, “The theory of the interleaving distance on multi- dimensional persistence modules,”Foundations of Computational Mathematics, vol. 15, no. 3, pp. 613–650, 2015

2015
[26]

XGBoost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785– 794

2016
[27]

The gudhi library: Simplicial complexes and persistent homology,

C. Maria, J.-D. Boissonnat, M. Glisse, and M. Yvinec, “The gudhi library: Simplicial complexes and persistent homology,” inInter- national congress on mathematical software. Springer, 2014, pp. 167– 174

2014
[28]

Inference for the generalization error,

C. Nadeau and Y. Bengio, “Inference for the generalization error,” Machine Learning, vol. 52, no. 3, pp. 239–281, Sep 2003. [Online]. Available: https://doi.org/10.1023/A:1024068626366

work page doi:10.1023/a:1024068626366 2003
[29]

Evaluating the replicability of significance tests for comparing learning algorithms,

R. R. Bouckaert and E. Frank, “Evaluating the replicability of significance tests for comparing learning algorithms,” inPacific- Asia conference on knowledge discovery and data mining. Springer, 2004, pp. 3–12

2004
[30]

T. M. Mitchell,Machine Learning. McGraw-Hill, 1997

1997
[31]

A model of inductive bias learning,

J. Baxter, “A model of inductive bias learning,”Journal of artificial intelligence research, vol. 12, pp. 149–198, 2000

2000
[32]

Biology- informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability,

K. Kontolati, R. J. Gladstone, I. Davis, and E. Pickering, “Biology- informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability,” arXiv preprint arXiv:2510.14970, 2025

arXiv 2025
[33]

A review of some techniques for inclusion of domain-knowledge into deep neural networks,

T. Dash, S. Chitlangia, A. Ahuja, and A. Srinivasan, “A review of some techniques for inclusion of domain-knowledge into deep neural networks,”Scientific Reports, vol. 12, no. 1, p. 1040, 2022

2022
[34]

BIRDNet: Mining and encoding boolean implication knowledge graphs as interpretable deep neural networks,

T. Dash, “BIRDNet: Mining and encoding boolean implication knowledge graphs as interpretable deep neural networks,”arXiv preprint arXiv:2605.28739, 2026

Pith/arXiv arXiv 2026
[35]

Consensus proposals for classification of the family hepeviridae,

D. B. Smith, P . Simmonds, I. C. on the Taxonomy of Viruses Hepe- viridae Study Groupet al., “Consensus proposals for classification of the family hepeviridae,”Journal of General Virology, vol. 95, no. 10, pp. 2223–2232, 2014

2014
[36]

Topological estimation using witness complexes

V . De Silva and G. E. Carlsson, “Topological estimation using witness complexes.” inPBG, 2004, pp. 157–166

2004
[37]

Persistence images: A stable vector representation of persistent homology,

H. Adams, T. Emerson, M. Kirby, R. Neville, C. Petersonet al., “Persistence images: A stable vector representation of persistent homology,”Journal of Machine Learning Research, vol. 18, no. 8, pp. 1–35, 2017

2017
[38]

Learning representa- tions of persistence barcodes,

C. D. Hofer, R. Kwitt, and M. Niethammer, “Learning representa- tions of persistence barcodes,”Journal of Machine Learning Research, vol. 20, no. 126, pp. 1–45, 2019

2019

[1] [1]

Alignment-free sequence comparison: benefits, applications, and tools,

A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski, “Alignment-free sequence comparison: benefits, applications, and tools,”Genome biology, vol. 18, no. 1, p. 186, 2017

2017

[2] [2]

Alignment- free genome comparison with feature frequency profiles (ffp) and optimal resolutions,

G. E. Sims, S.-R. Jun, G. A. Wu, and S.-H. Kim, “Alignment- free genome comparison with feature frequency profiles (ffp) and optimal resolutions,”Proceedings of the National Academy of Sciences, vol. 106, no. 8, pp. 2677–2682, 2009

2009

[3] [3]

A novel method of characterizing genetic sequences: genome space with biological distance and applications,

M. Deng, C. Yu, Q. Liang, R. L. He, and S. S.-T. Yau, “A novel method of characterizing genetic sequences: genome space with biological distance and applications,”PloS one, vol. 6, no. 3, p. e17293, 2011. 12 TABLE 9 Sensitivity tok-mer size: XGBoost accuracy (%, mean±std over a single 5-fold split). The primepis invariant by Proposition 12. kLow-sample da...

2011

[4] [4]

Mash: fast genome and metagenome distance estimation using minhash,

B. D. Ondov, T. J. Treangen, P . Melsted, A. B. Mallonee, N. H. Bergmanet al., “Mash: fast genome and metagenome distance estimation using minhash,”Genome biology, vol. 17, no. 1, p. 132, 2016

2016

[5] [5]

Topology of viral evolution,

J. M. Chan, G. Carlsson, and R. Rabadan, “Topology of viral evolution,”Proceedings of the National Academy of Sciences, vol. 110, no. 46, pp. 18 566–18 571, 2013

2013

[6] [6]

Revealing the shape of genome space via k-mer topology,

Y. Hozumi and G.-W. Wei, “Revealing the shape of genome space via k-mer topology,”arXiv preprint arXiv:2412.20202, 2024

arXiv 2024

[7] [7]

Cakl: Commutative algebra k-mer learning of genomics,

F. Suwayyid, Y. Hozumi, H. Feng, M. Zia, J. Wee, and G.-W. Wei, “Cakl: Commutative algebra k-mer learning of genomics,”arXiv preprint arXiv:2508.09406, 2025

arXiv 2025

[8] [8]

p-adic modelling of the genome and the genetic code,

B. Dragovich and A. Dragovich, “p-adic modelling of the genome and the genetic code,”The Computer Journal, vol. 53, no. 4, pp. 432–442, 2010

2010

[9] [9]

p-adic mathematics and theoretical biology,

B. Dragovich, A. Y. Khrennikov, S. V . Kozyrev, and N. ˇZ. Mi ˇsi´c, “p-adic mathematics and theoretical biology,”Biosystems, vol. 199, p. 104288, 2021

2021

[10] [10]

Characterization, stability and convergence of hierarchical clustering methods

G. E. Carlsson, F. M ´emoliet al., “Characterization, stability and convergence of hierarchical clustering methods.”J. Mach. Learn. Res., vol. 11, no. 47, pp. 1425–1470, 2010

2010

[11] [11]

Semple, M

C. Semple, M. Steelet al.,Phylogenetics. Oxford University Press on Demand, 2003, vol. 24

2003

[12] [12]

DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,

Y. Ji, Z. Zhou, H. Liu, and R. V . Davuluri, “DNABERT: pre-trained bidirectional encoder representations from transformers model for dna-language in genome,”Bioinformatics, vol. 37, no. 15, pp. 2112– 2120, 2021

2021

[13] [13]

Nucleotide transformer: building and evaluating robust foundation models for human genomics,

H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. Lopez Car- ranzaet al., “Nucleotide transformer: building and evaluating robust foundation models for human genomics,”Nature Methods, vol. 22, no. 2, pp. 287–297, 2025

2025

[14] [14]

Dna language model grover learns sequence context in the human genome,

M. Sanabria, J. Hirsch, P . M. Joubert, and A. R. Poetsch, “Dna language model grover learns sequence context in the human genome,”Nature Machine Intelligence, vol. 6, no. 8, pp. 911–923, 2024

2024

[15] [15]

Topological methods for genomics: present and future directions,

P . G. C ´amara, “Topological methods for genomics: present and future directions,”Current opinion in systems biology, vol. 1, pp. 95–101, 2017

2017

[16] [16]

Rabadan and A

R. Rabadan and A. J. Blumberg,Topological data analysis for genomics and evolution: topology in biology. Cambridge University Press, 2019

2019

[17] [17]

Ultrametrics in the genetic code and the genome,

B. Dragovich, A. Y. Khrennikov, and N. ˇZ. Mi ˇsi´c, “Ultrametrics in the genetic code and the genome,”Applied Mathematics and Computation, vol. 309, pp. 350–358, 2017

2017

[18] [18]

p-clustval: a novel p-adic approach for enhanced clustering of high- dimensional single-cell rnaseq data,

P . Sharma, S. Mishra, H. Kurban, and M. Dalkilic, “p-clustval: a novel p-adic approach for enhanced clustering of high- dimensional single-cell rnaseq data,”International Journal of Data Science and Analytics, vol. 20, no. 4, pp. 4051–4066, 2025

2025

[19] [19]

v-punns: van der put neural networks for transparent ultrametric representation learning,

G. L. R. N’guessan, “v-punns: van der put neural networks for transparent ultrametric representation learning,”arXiv preprint arXiv:2508.01010, 2025

arXiv 2025

[20] [20]

Learning with thep-adics,

A. F. Martins, “Learning with thep-adics,”arXiv preprint arXiv:2512.22692, 2025

arXiv 2025

[21] [21]

Barcodes: the persistent topology of data,

R. Ghrist, “Barcodes: the persistent topology of data,”Bulletin of the American Mathematical Society, vol. 45, no. 1, pp. 61–75, 2008

2008

[22] [22]

Topological persistence and simplification,

H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,”Discrete & computational geometry, vol. 28, no. 4, pp. 511–533, 2002

2002

[23] [23]

Computing persistent homol- ogy,

A. Zomorodian and G. Carlsson, “Computing persistent homol- ogy,” inProceedings of the twentieth annual symposium on Computa- tional geometry, 2004, pp. 347–356

2004

[24] [24]

The theory of multidimensional persistence,

G. Carlsson and A. Zomorodian, “The theory of multidimensional persistence,” inProceedings of the twenty-third annual symposium on Computational geometry, 2007, pp. 184–193

2007

[25] [25]

The theory of the interleaving distance on multi- dimensional persistence modules,

M. Lesnick, “The theory of the interleaving distance on multi- dimensional persistence modules,”Foundations of Computational Mathematics, vol. 15, no. 3, pp. 613–650, 2015

2015

[26] [26]

XGBoost: A scalable tree boosting system,

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785– 794

2016

[27] [27]

The gudhi library: Simplicial complexes and persistent homology,

C. Maria, J.-D. Boissonnat, M. Glisse, and M. Yvinec, “The gudhi library: Simplicial complexes and persistent homology,” inInter- national congress on mathematical software. Springer, 2014, pp. 167– 174

2014

[28] [28]

Inference for the generalization error,

C. Nadeau and Y. Bengio, “Inference for the generalization error,” Machine Learning, vol. 52, no. 3, pp. 239–281, Sep 2003. [Online]. Available: https://doi.org/10.1023/A:1024068626366

work page doi:10.1023/a:1024068626366 2003

[29] [29]

Evaluating the replicability of significance tests for comparing learning algorithms,

R. R. Bouckaert and E. Frank, “Evaluating the replicability of significance tests for comparing learning algorithms,” inPacific- Asia conference on knowledge discovery and data mining. Springer, 2004, pp. 3–12

2004

[30] [30]

T. M. Mitchell,Machine Learning. McGraw-Hill, 1997

1997

[31] [31]

A model of inductive bias learning,

J. Baxter, “A model of inductive bias learning,”Journal of artificial intelligence research, vol. 12, pp. 149–198, 2000

2000

[32] [32]

Biology- informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability,

K. Kontolati, R. J. Gladstone, I. Davis, and E. Pickering, “Biology- informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability,” arXiv preprint arXiv:2510.14970, 2025

arXiv 2025

[33] [33]

A review of some techniques for inclusion of domain-knowledge into deep neural networks,

T. Dash, S. Chitlangia, A. Ahuja, and A. Srinivasan, “A review of some techniques for inclusion of domain-knowledge into deep neural networks,”Scientific Reports, vol. 12, no. 1, p. 1040, 2022

2022

[34] [34]

BIRDNet: Mining and encoding boolean implication knowledge graphs as interpretable deep neural networks,

T. Dash, “BIRDNet: Mining and encoding boolean implication knowledge graphs as interpretable deep neural networks,”arXiv preprint arXiv:2605.28739, 2026

Pith/arXiv arXiv 2026

[35] [35]

Consensus proposals for classification of the family hepeviridae,

D. B. Smith, P . Simmonds, I. C. on the Taxonomy of Viruses Hepe- viridae Study Groupet al., “Consensus proposals for classification of the family hepeviridae,”Journal of General Virology, vol. 95, no. 10, pp. 2223–2232, 2014

2014

[36] [36]

Topological estimation using witness complexes

V . De Silva and G. E. Carlsson, “Topological estimation using witness complexes.” inPBG, 2004, pp. 157–166

2004

[37] [37]

Persistence images: A stable vector representation of persistent homology,

H. Adams, T. Emerson, M. Kirby, R. Neville, C. Petersonet al., “Persistence images: A stable vector representation of persistent homology,”Journal of Machine Learning Research, vol. 18, no. 8, pp. 1–35, 2017

2017

[38] [38]

Learning representa- tions of persistence barcodes,

C. D. Hofer, R. Kwitt, and M. Niethammer, “Learning representa- tions of persistence barcodes,”Journal of Machine Learning Research, vol. 20, no. 126, pp. 1–45, 2019

2019