arxiv: 2602.10603 · v3 · submitted 2026-02-11 · 💻 cs.LG

dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Arnav Shah , Junzhe Li , Parsa Idehpour , Adibvafa Fallahpour , Brandon Wang , Sukjun Hwang , Bo Wang , Patrick D. Hsu

show 2 more authors

Hani Goodarzi Albert Gu

This is my paper

Pith reviewed 2026-05-16 02:09 UTC · model grok-4.3

classification 💻 cs.LG

keywords genomic foundation modelsDNA sequence learningdynamic chunkinghierarchical modelingzero-shot predictionprotein variant fitnessgene essentialityautoregressive models

0 comments p. Extension

The pith

dnaHNet uses differentiable dynamic chunking to compress raw DNA nucleotides into adaptive latent tokens, enabling efficient autoregressive modeling and unsupervised discovery of hierarchical biological structures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Genomic foundation models struggle with a core tradeoff: fixed tokenizers break apart meaningful units like codons while nucleotide-level processing becomes too expensive for long sequences. dnaHNet solves this by training a tokenizer-free autoregressive model that learns to segment sequences end-to-end via a differentiable dynamic chunking process. The mechanism compresses input nucleotides into fewer latent tokens while maintaining predictive accuracy on prokaryotic genomes. This yields quadratic reductions in computation, more than three times faster inference than Transformers, and better zero-shot results than prior models on protein variant fitness and gene essentiality prediction. The same process surfaces hierarchical biological patterns without any labeled supervision.

Core claim

dnaHNet is a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end using a differentiable dynamic chunking mechanism. The mechanism adaptively compresses raw nucleotides into latent tokens to balance compression against accuracy, producing quadratic FLOP savings and over 3x inference speedup relative to Transformers. Pretrained on prokaryotic genomes, the model outperforms leading architectures such as StripedHyena2 on scaling and efficiency while delivering superior zero-shot performance on protein variant fitness and gene essentiality tasks and automatically revealing hierarchical biological structures.

What carries the argument

The differentiable dynamic chunking mechanism that adaptively segments raw nucleotide sequences into latent tokens to reduce computation while preserving predictive accuracy.

If this is right

Quadratic FLOP reductions allow modeling of substantially longer genomic contexts than fixed-vocabulary or nucleotide-level baselines.
More than 3 times faster inference enables practical deployment on longer sequences.
Superior zero-shot accuracy on protein variant fitness prediction without task-specific fine-tuning.
Superior zero-shot accuracy on gene essentiality prediction without task-specific fine-tuning.
Automatic emergence of hierarchical biological structures from unsupervised pretraining alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchical representations could be inspected to test whether discovered chunks align with known regulatory elements across different species.
Efficiency gains may allow pretraining on much larger eukaryotic datasets that current fixed-vocabulary models cannot handle.
The same chunking approach might transfer to other long sequential biological data such as protein sequences or RNA structures.
Discovered latent tokens could serve as a new vocabulary for downstream generative design of synthetic DNA.

Load-bearing premise

The dynamic chunking process successfully preserves biologically meaningful motifs such as codons and regulatory elements while achieving compression.

What would settle it

On a held-out set of zero-shot protein variant fitness or gene essentiality predictions, dnaHNet accuracy falling below that of StripedHyena2 or comparable baselines would falsify the performance advantage.

read the original abstract

Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

dnaHNet's differentiable dynamic chunking gives a workable path to longer-context genomic autoregressive models with real efficiency gains, but the unsupervised discovery of biological hierarchies rests on thin evidence.

read the letter

The core advance here is a tokenizer-free autoregressive setup that learns to chunk raw nucleotides differentiably during training. This sidesteps the motif fragmentation of fixed vocabularies and the length limits of pure base-level models. Pretrained on prokaryotic genomes, it reports better scaling than StripedHyena2, quadratic FLOP reductions, and more than 3x inference speedup over standard Transformers. Those efficiency numbers look like the strongest part of the work and address a practical bottleneck in the field. The zero-shot gains on variant fitness and gene essentiality are also worth noting if they hold in the full experiments. What the paper does cleanly is show that end-to-end chunking can be trained without collapsing into trivial compression. The soft spot is the claim that the model automatically discovers hierarchical biological structures without supervision. The abstract and stress-test note suggest this rests mainly on qualitative inspection rather than controlled metrics such as motif enrichment at chunk boundaries, precision-recall against annotated codons or regulatory elements, or ablations against length-matched random chunkers. Without those, it is difficult to separate genuine biological coherence from generic compression benefits. The lack of reported error bars, ablation tables, or derivation details in the summary also leaves the performance claims hard to evaluate precisely. This paper is for groups already working on genomic foundation models and long-context sequence architectures. A reader focused on efficiency tricks or new inductive biases for DNA would find the chunking mechanism useful to examine. It deserves a serious referee because the technical approach targets a real limitation and the efficiency results appear grounded, even though the interpretability story needs tighter validation before the claims can be taken at face value.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces dnaHNet, a tokenizer-free autoregressive foundation model for genomic sequences that employs a differentiable dynamic chunking mechanism to adaptively segment raw nucleotide sequences into latent tokens. Pretrained on prokaryotic genomes, it claims to outperform architectures such as StripedHyena2 in scaling and efficiency (with >3× inference speedup via quadratic FLOP reductions), while achieving superior zero-shot performance on protein variant fitness and gene essentiality prediction and automatically discovering hierarchical biological structures without supervision.

Significance. If the empirical claims are substantiated with quantitative validation of the chunking mechanism and proper ablations, the work could advance genomic foundation models by resolving the tradeoff between preserving biologically meaningful motifs and achieving scalable long-context modeling, offering a more interpretable alternative to fixed-vocabulary tokenizers.

major comments (3)

[Abstract] Abstract: the claim that dnaHNet 'automatically discovering hierarchical biological structures without supervision' and achieves superior zero-shot performance rests on the differentiable dynamic chunking producing biologically coherent segments, yet no boundary enrichment statistics, precision-recall against motif annotations, or ablation against length-matched random chunkers are referenced to support attribution of gains to meaningful representations rather than generic compression.
[Abstract] Abstract and Methods: the assertion of 'quadratic FLOP reductions' and '>3× inference speedup' from recursive chunking lacks a formal derivation, complexity analysis, or comparison table showing wall-clock times and memory usage versus baselines such as StripedHyena2 under matched sequence lengths.
[Experiments] Experiments section: zero-shot results on protein variant fitness and gene essentiality are stated without error bars, statistical significance tests, or ablation studies isolating the contribution of the learned chunking versus fixed-length or random segmentation baselines.

minor comments (1)

[Abstract] Abstract: the phrase 'state-of-the-art tokenizer-free autoregressive model' should specify the exact set of baselines and metrics used for this designation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We have revised the manuscript to address all major points raised, adding the necessary quantitative support, formal analyses, and ablations as detailed in our point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that dnaHNet 'automatically discovering hierarchical biological structures without supervision' and achieves superior zero-shot performance rests on the differentiable dynamic chunking producing biologically coherent segments, yet no boundary enrichment statistics, precision-recall against motif annotations, or ablation against length-matched random chunkers are referenced to support attribution of gains to meaningful representations rather than generic compression.

Authors: We agree that additional quantitative validation strengthens the attribution of performance gains to the learned chunking mechanism. In the revised manuscript, we have added boundary enrichment statistics comparing chunk boundaries to known motif annotations, precision-recall metrics, and an ablation study against length-matched random chunkers. These additions confirm that the dynamic chunking discovers biologically coherent segments beyond generic compression. revision: yes
Referee: [Abstract] Abstract and Methods: the assertion of 'quadratic FLOP reductions' and '>3× inference speedup' from recursive chunking lacks a formal derivation, complexity analysis, or comparison table showing wall-clock times and memory usage versus baselines such as StripedHyena2 under matched sequence lengths.

Authors: We have incorporated a formal derivation of the complexity in the Methods section, showing how the recursive chunking leads to quadratic FLOP reductions. We also added a comparison table detailing wall-clock inference times and memory usage for dnaHNet versus StripedHyena2 across various sequence lengths, confirming the >3× speedup. revision: yes
Referee: [Experiments] Experiments section: zero-shot results on protein variant fitness and gene essentiality are stated without error bars, statistical significance tests, or ablation studies isolating the contribution of the learned chunking versus fixed-length or random segmentation baselines.

Authors: We acknowledge the need for rigorous statistical reporting. The revised Experiments section now includes error bars from multiple runs, p-values from statistical significance tests, and ablation studies comparing the learned chunking to fixed-length and random segmentation baselines, isolating its contribution to the zero-shot performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces dnaHNet via a differentiable dynamic chunking mechanism and reports zero-shot performance on variant fitness and essentiality tasks. No equations, derivations, or first-principles results are described that reduce outputs to inputs by construction. Claims of hierarchical structure discovery are presented as outcomes of end-to-end training and evaluated against external tasks and baselines (e.g., StripedHyena2), with no self-citation load-bearing steps, fitted-input renamings, or ansatz smuggling. The derivation chain is self-contained and does not collapse to tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; the central claim rests on the unverified assumption that the chunking mechanism preserves biological coherence.

pith-pipeline@v0.9.0 · 5512 in / 1151 out tokens · 80092 ms · 2026-05-16T02:09:57.338303+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively... recursive chunking yields quadratic FLOP reductions... automatically discovering hierarchical biological structures without supervision.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Target Compression Ratios... R1 = 3 for the first stage to align with the triplet codon structure... R2 = 2... effective compression ratio of R1 × R2 = 6

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.