arxiv: 2604.05774 · v1 · submitted 2026-04-07 · 🧬 q-bio.GN · cs.CL

Recognition: no theorem link

GenomeQA: Benchmarking General Large Language Models for Genome Sequence Understanding

Weicai Long , Yusen Hou , Junning Feng , Houcheng Su , Shuo Yang , Donglin Xie , Yanlin Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3

classification 🧬 q-bio.GN cs.CL

keywords genome sequence understandinglarge language modelsbenchmarkDNA sequencesgenomicsLLM evaluationsequence inference

0 comments

The pith

General LLMs outperform random baselines on raw genome sequences by detecting local signals such as GC content and short motifs, but accuracy falls when tasks require multi-step inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GenomeQA to test general-purpose large language models on direct DNA sequence inputs rather than text descriptions or specialized training. It assembles 5,200 examples drawn from biological databases and organizes them into six task families that range from identifying enhancers and promoters to predicting transcription factor binding. Evaluation across six frontier models shows consistent gains above chance levels on tasks that hinge on immediate sequence features, while results weaken on problems that demand combining patterns across longer distances or multiple logical steps. This setup matters because conversational use of LLMs in genomics is growing, and the benchmark isolates how well models handle sequence data itself before external annotations are supplied.

Core claim

GenomeQA demonstrates that frontier LLMs can exploit local sequence properties such as GC content and short motifs to exceed random performance on genome inference tasks, yet their success declines sharply when the required reasoning involves indirect or multi-step integration of sequence patterns across the six evaluated task families.

What carries the argument

The GenomeQA benchmark itself, a collection of 5,200 controlled sequence samples spanning six task families that directly exposes general LLMs to raw DNA strings for inference without external biological text.

If this is right

Models achieve above-chance results by detecting immediate sequence features such as GC content and short motifs.
Performance drops on tasks that require combining information across multiple steps or indirect patterns.
The benchmark isolates sequence-only reasoning from knowledge retrieval, providing a diagnostic tool for model improvement.
Current frontier LLMs remain limited on genome tasks that go beyond local pattern matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model training could add explicit objectives that reward integration of distant sequence signals to close the performance gap on harder tasks.
The same benchmark format could be extended to longer sequences or additional genome annotations to track progress as context windows grow.
Developers of genomics chat interfaces might route simple local-pattern queries to LLMs while reserving multi-step problems for specialized tools.

Load-bearing premise

The six chosen task families and 5,200 samples drawn from existing databases form a representative proxy for how general LLMs would encounter real genome sequences in practice.

What would settle it

An experiment in which the same six models are retested after all local motifs and GC content are deliberately randomized in the input sequences; if performance remains above the random baseline, the claim that models rely on those local signals would be contradicted.

Figures

Figures reproduced from arXiv: 2604.05774 by Donglin Xie, Houcheng Su, Junning Feng, Shuo Yang, Weicai Long, Yanlin Zhang, Yusen Hou.

**Figure 2.** Figure 2: Distribution of Options in BCQ (a) and MCQ [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Performance comparison of LLMs across six [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of failed cases. Failures occur when models rely on general sequence elements while neglecting specific details. For instance, in Histone Mark Prediction, the model incorrectly classifies an open Alu repeat as closed (Su et al., 2014). It simply applies the general rule that transposable elements are repressed and overlooks the high GC content of this specific element. Base Composition Over-… view at source ↗

**Figure 5.** Figure 5: Label statistics for all tasks. The charts display the distribution of question types for BCQ and MCQ. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Motifs in TFBS Prediction. A.6.2 Label Statistics In this task, while genome sequences naturally contain multiple binding motifs (multi-label), each question interrogates the presence of a specific target Transcription Factor selected from a set of 20. The details are provided in [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: The Base Prompt used as a baseline. It contains only basic task descriptions and formatting instructions [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The Optimized Prompt designed for the LLM. It includes explicit role definition, domain constraints, a [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Case Study. Examples of failure modes in GenomeQA along with input details and model responses. [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly adopted as conversational assistants in genomics, where they are mainly used to reason over biological knowledge, annotations, and analysis outputs through natural language interfaces. However, existing benchmarks either focus on specialized DNA models trained for sequence prediction or evaluate biological knowledge using text-only questions, leaving the behavior of general-purpose LLMs when directly exposed to raw genome sequences underexplored. We introduce GenomeQA, a benchmark designed to provide a controlled evaluation setting for general-purpose LLMs on sequence-based genome inference tasks. GenomeQA comprises 5,200 samples drawn from multiple biological databases, with sequence lengths ranging from 6 to 1,000 base pairs (bp), spanning six task families: Enhancer and Promoter Identification, Splice Site Identification, Taxonomic Classification, Histone Mark Prediction, Transcription Factor Binding Site Prediction, and TF Motif Prediction. Across six frontier LLMs, we find that models consistently outperform random baselines and can exploit local sequence signals such as GC content and short motifs, while performance degrades on tasks that require more indirect or multi-step inference over sequence patterns. GenomeQA establishes a diagnostic benchmark for studying and improving the use of general-purpose LLMs on raw genomic sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GenomeQA gives a first look at frontier LLMs on raw DNA sequences and finds they catch local patterns but not deeper ones, though the methods are too lightly described to support firm claims.

read the letter

The main point is that this paper builds a benchmark of 5200 sequence samples across six tasks—enhancer and promoter ID, splice sites, taxonomy, histone marks, TF binding, and motif prediction—and tests six general LLMs directly on the base-pair strings. The headline result is that the models beat random on tasks with obvious local signals like GC content or short motifs but drop when the task needs chaining inferences across the sequence. That pattern is new in the sense that prior work either used specialized DNA models or text-only biology questions, so this controlled raw-sequence setup fills a clear gap in how we evaluate general LLMs for genomics work. The task families are sensible choices drawn from existing databases, and the length range (6–1000 bp) is realistic for many short regulatory elements. The paper does a clean job of framing the benchmark as diagnostic rather than claiming LLMs are ready for sequence analysis. The reported degradation on multi-step tasks matches what one would expect from current transformer limits on long-range dependencies in short strings. Soft spots sit mostly in the evaluation details. The abstract gives no information on sampling strategy, prompt wording, temperature settings, or how length effects were isolated, so it is hard to know whether the performance differences are driven by the intended biology or by artifacts. Random baselines are mentioned but without variance or significance numbers, the gap between model and baseline stays hard to interpret. These issues are common in early benchmark papers and look fixable with added controls and ablations rather than fatal. The weakest assumption is that these six families plus 5200 samples stand in for broader genome understanding; that is a reasonable starting proxy but needs explicit discussion of coverage and bias. This work is for people who build or apply LLMs to biological sequences and want a quick sense of current limits. A reader already working on sequence models or LLM evaluation in genomics would get immediate value from the task list and the local-vs-global split. It is not yet a finished reference benchmark. The paper deserves a serious referee because the idea is timely, the results are internally consistent with the task design, and the gap it targets is real. A review would mainly push for tighter methods reporting and a few extra controls on length and prompting. I would send it to peer review.

Referee Report

3 major / 2 minor

Summary. The paper introduces GenomeQA, a benchmark of 5,200 samples drawn from biological databases spanning six task families (Enhancer/Promoter Identification, Splice Site Identification, Taxonomic Classification, Histone Mark Prediction, TF Binding Site Prediction, and TF Motif Prediction) with sequence lengths from 6 to 1,000 bp. It evaluates six frontier general-purpose LLMs on raw sequence inputs, reporting that models outperform random baselines by exploiting local signals such as GC content and short motifs, while performance degrades on tasks requiring indirect or multi-step inference over sequence patterns.

Significance. If the evaluation protocol is shown to be robust and controlled, GenomeQA would offer a useful diagnostic tool for assessing general LLMs on direct genomic sequence reasoning, distinguishing local pattern recognition from higher-order inference. The empirical distinction between local-signal and multi-step tasks could inform both model development and the design of genomics-specific prompting or fine-tuning strategies.

major comments (3)

[Benchmark construction] Benchmark construction (likely §3 or equivalent): the description of sample selection and task construction provides no details on stratification by sequence length, database source, or controls to isolate length effects, yet the central claim compares performance across tasks with lengths ranging 6–1,000 bp; without such controls the reported degradation on multi-step tasks could be confounded by length rather than inference depth.
[Experimental setup] Experimental setup (likely §4): the paper does not specify the exact prompt templates, tokenization of raw sequences, or output parsing procedure used for the six LLMs; these choices are load-bearing for the claim that models “exploit local sequence signals” because different formatting can alter whether GC content or motifs are accessible in context.
[Results and analysis] Results and analysis (likely §5): no statistical tests, confidence intervals, or multiple-comparison corrections are reported for the performance differences versus random baselines or across task families; given that the abstract highlights consistent outperformance on local-signal tasks, the absence of significance testing weakens the support for the strongest claim.

minor comments (2)

[Abstract] The abstract states “5,200 samples” but does not indicate whether this is the final evaluated set or includes held-out data; clarify the split in the main text.
[Throughout] Task family names are capitalized inconsistently between the abstract and later sections; standardize nomenclature.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for improving the clarity and rigor of the GenomeQA benchmark. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (likely §3 or equivalent): the description of sample selection and task construction provides no details on stratification by sequence length, database source, or controls to isolate length effects, yet the central claim compares performance across tasks with lengths ranging 6–1,000 bp; without such controls the reported degradation on multi-step tasks could be confounded by length rather than inference depth.

Authors: We acknowledge that the original manuscript provides insufficient detail on sample selection and stratification. In the revised version, we will expand §3 to include the exact sampling procedure from each database, any balancing across length bins (e.g., reporting mean and distribution of lengths per task family), and source-specific controls. To directly address potential confounding, we will add an analysis of model performance as a function of sequence length within each task family, demonstrating that the observed degradation on multi-step inference tasks persists even after controlling for length. This will clarify that the distinction between local-signal and multi-step tasks is not an artifact of length variation. revision: yes
Referee: [Experimental setup] Experimental setup (likely §4): the paper does not specify the exact prompt templates, tokenization of raw sequences, or output parsing procedure used for the six LLMs; these choices are load-bearing for the claim that models “exploit local sequence signals” because different formatting can alter whether GC content or motifs are accessible in context.

Authors: We agree that these implementation details are essential for reproducibility and for supporting claims about local signal exploitation. The revised manuscript will include the complete prompt templates for each of the six task families, a description of how raw DNA sequences are tokenized and formatted (as plain strings using standard nucleotide characters), and the exact output parsing rules used to extract model answers (including handling of free-form responses). These additions will allow readers to verify how local patterns such as GC content remain accessible in the input context. revision: yes
Referee: [Results and analysis] Results and analysis (likely §5): no statistical tests, confidence intervals, or multiple-comparison corrections are reported for the performance differences versus random baselines or across task families; given that the abstract highlights consistent outperformance on local-signal tasks, the absence of significance testing weakens the support for the strongest claim.

Authors: We accept this criticism. The current results section reports only point estimates of accuracy. In the revision, we will add binomial tests (or appropriate non-parametric equivalents) comparing each model’s performance against the random baseline for every task, report 95% confidence intervals on all accuracy figures, and apply multiple-comparison corrections (e.g., Bonferroni or FDR) when testing differences across the six task families. These statistical results will be presented in updated tables and figures to provide rigorous support for the reported outperformance on local-signal tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper with no derivations, equations, fitted parameters, or load-bearing self-citations. GenomeQA is constructed from external biological databases (5,200 samples across six task families), and results are obtained by direct evaluation of six frontier LLMs against random baselines. Claims about exploiting local signals (GC content, motifs) versus degrading on multi-step inference follow immediately from the task design and observed performance; no step reduces to its inputs by construction. The benchmark is self-contained against external data sources and standard controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the selected tasks adequately capture genome sequence understanding and that samples from existing databases are unbiased for LLM evaluation.

axioms (1)

domain assumption The six task families represent core genome sequence inference problems suitable for LLM testing.
Stated in the abstract without further justification of coverage or importance.

pith-pipeline@v0.9.0 · 5534 in / 1094 out tokens · 46477 ms · 2026-05-10T18:36:38.960271+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 3 canonical work pages

[1]

Gena-lm: a family of open-source founda- tional dna language models for long sequences.Nu- cleic Acids Research, 53(2):gkae1310. Google. 2025. Gemini 3 pro: Best for complex tasks and bringing creative concepts to life. Accessed: 2025-11-18. Charles E. Grant, Timothy L. Bailey, and William Stafford Noble. 2011. Fimo: scanning for occurrences of a given mo...

2025
[2]

arXiv preprint arXiv:2503.04013 , year=

Beyond chemical QA: Evaluating LLM’s chemical reasoning with modular chemical opera- tions. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track. Jiyue Jiang, Pengan Chen, Jiuming Wang, Dongchen He, Ziqin Wei, Liang Hong, Licheng Zong, Sheng Wang, Qinze Yu, Zixian Ma, and 1 others. 2025. Benchmarking ...

work page arXiv 2025
[3]

Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, and Wouter Boomsma

Augmenting large language models with chem- istry tools.Nature Machine Intelligence, 6(5):525– 535. Frederikke Isa Marin, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, and Wouter Boomsma. 2024. BEND: Benchmarking DNA language models on biologically meaningful tasks. InThe Twelfth International Conference on Learning Representatio...

2024
[4]

Eric Nguyen, Michael Poli, Matthew G

National Center for Biotechnology Informa- tion (NCBI).https://www.ncbi.nlm.nih.gov/. Eric Nguyen, Michael Poli, Matthew G. Durrant, Brian Kang, Dhruva Katrekar, David B. Li, Liam J. Bartie, Armin W. Thomas, Samuel H. King, Garyk Brixi, Jeremy Sullivan, Madelena Y . Ng, Ashley Lewis, Aaron Lou, Stefano Ermon, Stephen A. Baccus, Tina Hernandez-Boussard, Ch...

2024
[5]

Undefined

Advancing ai-scientist understanding: Making llm think like a physicist with interpretable reasoning. arXiv preprint arXiv:2504.01911. Ming Yin, Yuanhao Qu, Dyllan Liu, Ling Yang, Le Cong, and Mengdi Wang. 2025. Genome-bench: A scientific reasoning benchmark from real-world expert discussions.bioRxiv, pages 2025–06. Yizhen Zheng, Huan Yee Koh, Jiaxin Ju, ...

work page arXiv 2025
[6]

promoter)

Identify the specific biological task (e.g., distinguishing enhancer vs. promoter)
[7]

Keep this analysis concise (1-3 sentences) to justify your choice

Briefly analyze the key sequence features or biological context relevant to the question. Keep this analysis concise (1-3 sentences) to justify your choice
[8]

Strictly follow the following output format: ### Analysis [Concise analysis in 1-3 sentences] ### Answer Answer: [Option Letter] Figure 7: The Base Prompt used as a baseline

Select the most accurate option letter (e.g., A, B, C, D) . Strictly follow the following output format: ### Analysis [Concise analysis in 1-3 sentences] ### Answer Answer: [Option Letter] Figure 7: The Base Prompt used as a baseline. It contains only basic task descriptions and formatting instructions without domain-specific reasoning guidance. Model BCQ...

work page arXiv
[9]

Regulatory identity: enhancer vs promoter in human
[10]

Splice-site analysis: acceptor vs donor vs both, and distinguishing real sites from dinucleotide-preserved shuffled controls
[11]

Taxonomy: classifying sequences as eukaryote, prokaryote, or virus
[12]

Epigenetics: histone mark identity and chromatin state (open/accessible vs closed/repressive) in human K562 cells
[13]

TFBS prediction: locating binding sites of specific transcription factors within longer human sequences
[14]

is this X?

Motif identification: recognizing short (6-20 bp) consensus patterns. Please adhere to the following Reasoning Protocol: 1.Infer task & question type From the wording, infer the biological domain and whether the question is: binary verification, multi-class label selection, sequence selection, odd-one-out, or real-vs-shuffled discrimination. 2.Analyze seq...