Recognition: no theorem link
The Mechanistic Invariance Test: Genomic Language Models Fail to Learn Positional Regulatory Logic
Pith reviewed 2026-05-10 18:39 UTC · model grok-4.3
The pith
Genomic language models fail to learn positional regulatory logic and instead exploit AT content correlations in DNA sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through the Mechanistic Invariance Test and follow-up probes including AT titration, positional ablation, spacing changes, and strand flips, all tested genomic language models exhibit performance driven solely by correlation with AT content (r=0.78-0.96) rather than any understanding of positional grammar. Models invert biological reality by scoring incorrect positions higher than correct ones, remain strand-blind, and show compositional effects dominating positional ones by a factor of 46. Larger models amplify the bias, while a simple position-aware PWM reaches perfect scores on the benchmark.
What carries the argument
The Mechanistic Invariance Test (MIT), a 650-sequence benchmark across 8 classes with scrambled controls that separates compositional sensitivity from positional regulatory understanding.
Load-bearing premise
That the Mechanistic Invariance Test with its scrambled controls cleanly isolates positional regulatory logic from compositional effects without introducing other uncontrolled biases in sequence generation or scoring.
What would settle it
A direct measurement showing that genomic language models assign higher regulatory scores to sequences with correct element positions than to matched sequences with incorrect positions when AT content is held constant across all positions.
Figures
read the original abstract
Genomic language models (gLMs) have transformed computational biology, achieving state-of-the-art performance across genomic tasks. Yet a fundamental question threatens the foundation of this success: do these models learn the mechanistic principles governing gene regulation, or do they merely exploit statistical shortcuts? We introduce the Mechanistic Invariance Test (MIT), a rigorous 650-sequence benchmark across 8 classes with scrambled controls that enables clean discrimination between compositional sensitivity and genuine positional understanding. We evaluate five gLMs spanning all major architectural paradigms (autoregressive, masked, and bidirectional state-space models) and uncover a universal failure mode. Through systematic mechanistic probing via AT titration, positional ablation, spacing perturbation, and strand orientation tests, we demonstrate that apparent compensation sensitivity is driven entirely by AT content correlation (r=0.78-0.96 across architectures), not positional regulatory logic. The failures are striking: Evo2-1B and Caduceus score regulatory elements at incorrect positions higher than correct positions, inverting biological reality. All models are strand-blind. Compositional effects dominate positional effects by 46-fold. Perhaps most revealing, a simple 100-parameter position-aware PWM achieves perfect performance (CSS=1.00, SCR=0.98), exposing that billion-parameter gLMs fail not from insufficient capacity but from fundamentally misaligned inductive biases. Larger models show stronger compositional bias, demonstrating that scale amplifies rather than corrects this limitation. These findings reveal that current gLMs capture surface statistics while missing the positional grammar essential for gene regulation, demanding architectural innovation before deployment in synthetic biology, gene therapy, and clinical variant interpretation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Mechanistic Invariance Test (MIT), a 650-sequence benchmark across 8 classes with scrambled controls, to distinguish compositional sensitivity from positional regulatory logic in genomic language models. Evaluating five gLMs (autoregressive, masked, and state-space architectures) via AT titration, positional ablation, spacing perturbation, and strand orientation tests, it reports that all apparent positional effects are explained by AT-content correlation (r=0.78-0.96), with Evo2-1B and Caduceus inverting correct/incorrect position scores, all models being strand-blind, and compositional effects dominating positional ones by 46-fold. A 100-parameter position-aware PWM baseline achieves near-perfect scores (CSS=1.00, SCR=0.98), while larger models exhibit stronger compositional bias.
Significance. If the MIT controls are shown to isolate positional logic without residual compositional confounds, the work provides a valuable empirical demonstration that current gLMs capture surface statistics rather than the positional grammar of gene regulation. The systematic mechanistic probes and direct comparison to a lightweight PWM baseline are strengths, highlighting that the limitation is inductive bias rather than capacity. The finding that scale amplifies the bias has implications for deploying gLMs in synthetic biology and variant interpretation, and the benchmark itself offers a reusable test for future architectural improvements.
major comments (2)
- [MIT benchmark description] Section describing the MIT benchmark and scrambling procedure: the central claim that scrambled controls enable 'clean discrimination' between compositional and positional effects depends on explicit verification that scrambled sequences match originals on all statistics the models exploit (dinucleotide frequencies, motif co-occurrence, local GC gradients). Without reported Kolmogorov-Smirnov tests or matching statistics on these features, the AT-titration and ablation results could be driven by residual non-positional cues rather than pure composition.
- [Results on model inversions and dominance] Results section on positional ablation and inversion findings: the reported 46-fold dominance of compositional over positional effects and the inversion of correct vs. incorrect positions in Evo2-1B and Caduceus require the exact definition of the dominance ratio (e.g., which effect-size metric from the spacing-perturbation test) and statistical significance testing with correction for the 650-sequence multiple comparisons to support the 'universal failure mode' conclusion.
minor comments (2)
- [Abstract and PWM baseline] Abstract and methods: the 100-parameter PWM is described as 'position-aware' but its exact construction (e.g., how positions are encoded relative to the 8 classes) should be detailed with pseudocode or a small table to allow direct replication.
- [Figures] Figure captions for the AT-titration and strand-orientation plots: axis labels and error bars should explicitly state whether they represent mean ± SEM across the 650 sequences or per-class aggregates.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our manuscript. Below, we provide point-by-point responses to the major comments and outline the revisions we will implement.
read point-by-point responses
-
Referee: Section describing the MIT benchmark and scrambling procedure: the central claim that scrambled controls enable 'clean discrimination' between compositional and positional effects depends on explicit verification that scrambled sequences match originals on all statistics the models exploit (dinucleotide frequencies, motif co-occurrence, local GC gradients). Without reported Kolmogorov-Smirnov tests or matching statistics on these features, the AT-titration and ablation results could be driven by residual non-positional cues rather than pure composition.
Authors: We agree that explicit verification is necessary to fully substantiate the claim of clean discrimination. In the revised manuscript, we will add a supplementary section reporting Kolmogorov-Smirnov tests and matching statistics for dinucleotide frequencies, motif co-occurrence, and local GC gradients between original and scrambled sequences. These will confirm no significant residual differences, thereby strengthening the isolation of positional effects. revision: yes
-
Referee: Results section on positional ablation and inversion findings: the reported 46-fold dominance of compositional over positional effects and the inversion of correct vs. incorrect positions in Evo2-1B and Caduceus require the exact definition of the dominance ratio (e.g., which effect-size metric from the spacing-perturbation test) and statistical significance testing with correction for the 650-sequence multiple comparisons to support the 'universal failure mode' conclusion.
Authors: We will revise the results section to explicitly define the dominance ratio as the ratio of the compositional effect size (measured via the spacing-perturbation test) to the positional effect size (from positional ablation). We will also add statistical significance testing using Wilcoxon signed-rank tests with Bonferroni correction for multiple comparisons across the 650 sequences. These changes will provide rigorous quantitative support for the reported inversions and the 46-fold dominance. revision: yes
Circularity Check
No significant circularity: empirical benchmark with external controls
full rationale
The paper is an empirical benchmarking study that introduces the Mechanistic Invariance Test and applies it to evaluate gLMs on sequence perturbations. All reported results (AT correlations, positional scoring inversions, 46-fold dominance, PWM baseline performance) are direct experimental measurements on held-out or generated sequences rather than quantities derived from fitted parameters inside the same equations. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the central claims; the PWM baseline is presented as an independent comparator, not as a fitted input renamed as a prediction. The derivation chain is therefore self-contained against external sequence data and model outputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Scrambled controls preserve compositional statistics while fully removing positional information
Reference graph
Works this paper leans on
-
[1]
Positions 25 and 45 show anomalously large penalties (−2.53 and−2.17) because the UP element overlaps with the -35 box (positions 30–35) and -10 box (positions 53–58), disrupting their consensus sequences
-
[2]
Excluding these confounded positions, the positional effect ranges from−0.49 to +0.49—a total span of only 0.98 LL units
-
[3]
15:−3.70) is∼8×largerthan the maximum positional effect (0.46)
The compositional effect (None vs. 15:−3.70) is∼8×largerthan the maximum positional effect (0.46). B.4 FULLSPACINGSENSITIVITYRESULTS Table 11: Complete spacing sensitivity experiment (n= 50per spacing). Spacing (bp) Mean LL Std Dev∆vs. 17bp 12−143.47 4.10−1.20 13−142.71 4.56−0.44 14 (HyenaDNA peak)−141.79 4.87 +0.48 15−142.87 4.91−0.60 16−142.66 4.61−0.40...
-
[4]
HyenaDNA peaks at 14 bp, not the biologically optimal 17 bp
-
[5]
The total range across all spacings is only 1.68 LL units (−143.47 to−141.79)
-
[6]
For comparison, the AT content effect spans 21.0 LL units—12.5×larger
-
[7]
PA-PWM succeeds by construction
The model shows no preference for the biologically correct 17±1 bp range. 13 Workshop @ ICLR 2026 B.5 FULLSTRANDORIENTATIONRESULTS Table 12: Complete strand orientation experiment (n= 50per condition) for HyenaDNA. Condition Mean LL Std Dev∆vs. Forward Forward (correct)−143.79 4.45 0.00 RC motifs in place−142.83 3.99 +0.96 Full reverse complement−142.13 3...
2026
-
[8]
Our benchmark leverages this biological knowledge to create rigorous tests
are well-characterized biochemically. Our benchmark leverages this biological knowledge to create rigorous tests. M EXAMPLESEQUENCES We provide representative sequences from each class to illustrate the benchmark design. Note: positions 0–57 shown; full sequences are 100 bp with random background extending to position 99. M.1 CLASSC: SYNTHETICINTACT Pos: ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.