arxiv: 2604.05775 · v1 · submitted 2026-04-07 · 💻 cs.CL · q-bio.GN

PhageBench: Can LLMs Understand Raw Bacteriophage Genomes?

Yusen Hou , Weicai Long , Haitao Hu , Houcheng Su , Junning Feng , Yanlin Zhang This is my paper

Pith reviewed 2026-05-10 19:16 UTC · model grok-4.3

classification 💻 cs.CL q-bio.GN

keywords bacteriophage genomesLLM evaluationraw sequence understandingbioinformatics benchmarkphage contig identificationhost predictiongenomic reasoning

0 comments

The pith

LLMs outperform random guessing on phage contig identification and host prediction but fail at complex long-range reasoning on raw DNA sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhageBench, a benchmark of 5600 samples that tests whether large language models can interpret raw bacteriophage nucleotide sequences the way bioinformatics experts do. It structures the evaluation around three workflow stages: screening for phage contigs, quality control checks, and phenotype annotation such as host prediction. General-purpose reasoning models beat random baselines on the simpler identification and prediction tasks, indicating some capacity to extract biological signals directly from sequence data. The same models show clear weaknesses on tasks that require tracking dependencies across long stretches or pinpointing fine functional elements. These results point to both early promise and specific gaps in how current LLMs handle genomic sequences.

Core claim

PhageBench shows that general-purpose reasoning LLMs significantly outperform random baselines on phage contig identification and host prediction from raw sequences, yet they remain limited in tasks that involve long-range dependencies and fine-grained functional localization across the five core bioinformatics tasks.

What carries the argument

PhageBench benchmark consisting of 5600 high-quality samples across five tasks in screening, quality control, and phenotype annotation stages that mirror expert phage genome analysis workflows.

If this is right

Reasoning-focused LLMs could assist with initial phage sequence screening and host-range estimation without specialized fine-tuning.
Current models still cannot reliably localize functional elements or handle extended sequence context, limiting their use for detailed annotation.
PhageBench supplies a reusable test suite that can track future improvements in sequence-level biological reasoning.
Gaps revealed by the benchmark indicate the need for architectural changes that better capture long-distance nucleotide relationships.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to other viral families to test whether the observed pattern of partial success on simple tasks holds more broadly.
If reasoning models advance on these tasks, they might speed up identification of phages for therapeutic or ecological applications.
Failures on long-range tasks suggest that purely text-based training may miss structural or statistical patterns unique to raw nucleotide data.

Load-bearing premise

The 5600 samples are representative and the five tasks faithfully capture the actual decision steps bioinformatics experts use when examining bacteriophage genomes.

What would settle it

An experiment in which domain experts judge the benchmark tasks as irrelevant to real phage analysis or in which the tested LLMs perform at or below random levels on every task.

Figures

Figures reproduced from arXiv: 2604.05775 by Haitao Hu, Houcheng Su, Junning Feng, Weicai Long, Yanlin Zhang, Yusen Hou.

**Figure 1.** Figure 1: PhageBench challenges general-purpose LLMs to perform five phage analysis tasks, processing raw nucleotide sequence inputs to generate reasoningbased phenotypic predictions. complex syntax and semantics of DNA. These models can predict gene expression (Hou et al., 2024; Avsec et al., 2025) and even design whole genome of functional bacteriophages (phages) (King et al., 2025). However, these specialized m… view at source ↗

**Figure 2.** Figure 2: The overview of data construction of PhageBench. We filter raw sequences from public repositories, apply [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The compositional structure of PhageBench. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Statistical characteristics of the genomic sequences within the PhageBench dataset. The left panel displays [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance heatmaps across sequence length bins for Tasks 1–4. Each cell represents the accuracy of a [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: A case study of hallucinated reasoning in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed class and subclass distributions across the task 1–4. The pie charts demonstrate the strict [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Detailed class distributions of task 5. The pie charts demonstrate the strict balancing strategy applied to [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of correct answer keys across all [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: The zero shot with CoT prompt. sponses to prevent context window exhaustion and required all outputs to be structured in a machineparseable JSON format. This design facilitates automated quantitative evaluation and ensures that performance metrics reflect genuine reasoning ability rather than parsing errors. To quantify the impact of explicit reasoning strategies on genomic understanding, we implemente… view at source ↗

**Figure 11.** Figure 11: The zero shot prompt. hibit moderate performance. This reflects the difficulty of associating distal features like terminal repeats across long sequence contexts. Among the evaluated architectures, GPT-5.2 demonstrates superior performance in this category, suggesting a relatively stronger capacity for maintaining global structural coherence compared to other reasoning models. In the lifestyle classific… view at source ↗

**Figure 12.** Figure 12: Detailed performance analysis of LLMs on PhageBench tasks. (Top) Phage Identification results [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Case study of successful Phage Contig Identification. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Case study of failure in Contamination Detection. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Case study of structural hallucination in Completeness Estimation. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Case study of failure in Lifestyle Classification. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Case study of successful Host Prediction. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

read the original abstract

Bacteriophages, often referred to as the dark matter of the biosphere, play a critical role in regulating microbial ecosystems and in antibiotic alternatives. Thus, accurate interpretation of their genomes holds significant scientific and practical value. While general-purpose Large Language Models (LLMs) excel at understanding biological texts, their ability to directly interpret raw nucleotide sequences and perform biological reasoning remains underexplored. To address this, we introduce PhageBench, the first benchmark designed to evaluate phage genome understanding by mirroring the workflow of bioinformatics experts. The dataset contains 5,600 high-quality samples covering five core tasks across three stages: Screening, Quality Control, and Phenotype Annotation. Our evaluation of eight LLMs reveals that general-purpose reasoning models significantly outperform random baselines in phage contig identification and host prediction, demonstrating promising potential for genomic understanding. However, they exhibit significant limitations in complex reasoning tasks involving long-range dependencies and fine-grained functional localization. These findings highlight the necessity of developing next-generation models with enhanced reasoning capabilities for biological sequences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhageBench offers the first dedicated benchmark for LLMs on raw bacteriophage sequences with tasks that track real expert steps, but the abstract gives almost no numbers or controls so the claimed gains are hard to trust.

read the letter

PhageBench is new in framing a benchmark specifically around raw nucleotide strings from phages rather than protein sequences or natural-language descriptions. The 5600-sample set spans five tasks across screening, quality control, and phenotype annotation, which lines up with how people actually analyze these genomes. That structure is the main contribution and it is useful on its own for anyone who wants a starting point to test models on this domain. The abstract also correctly notes that current LLMs still fall short on the harder tasks that require long-range dependencies, which keeps the claims from overreaching. What the paper does well is simply exist as a concrete test suite where none was before. The evaluation of eight models and the observation that reasoning-oriented ones beat random on contig identification and host prediction is at least a baseline observation. The soft spots are more serious. No quantitative scores, confidence intervals, or dataset-construction details appear in the abstract, and the full text is not yet available to check them. The stress-test point lands: without ablations that rule out GC content, dinucleotide frequencies, or other compositional signals that LLMs could have seen in training, it is unclear whether the reported wins reflect any genuine modeling of phage biology. Simple k-mer or composition baselines would have been the obvious first control and they are not mentioned. This paper is mainly for people already working on LLMs for microbial genomics or phage therapy who need an off-the-shelf evaluation set. A reader who wants to run their own models on phage sequences could get immediate value from the task definitions. It is not yet ready for broad claims about LLM genomic understanding. I would send it to peer review because a new benchmark in this narrow but practical area is worth referee time to add the missing controls, numbers, and data-release details, even if the current version stays preliminary.

Referee Report

3 major / 2 minor

Summary. The paper introduces PhageBench, a benchmark of 5,600 high-quality samples covering five core tasks across Screening, Quality Control, and Phenotype Annotation stages to evaluate LLMs on raw bacteriophage genome understanding. Evaluation of eight LLMs finds that general-purpose reasoning models significantly outperform random baselines on phage contig identification and host prediction but show limitations on complex tasks involving long-range dependencies and fine-grained functional localization.

Significance. If the tasks are well-designed and results robust, the benchmark could help identify gaps in LLMs' ability to reason over raw nucleotide sequences rather than biological text, motivating specialized genomic models with applications in phage therapy and microbial ecology. The staged task design mirroring expert workflows is a strength if validated.

major comments (3)

[Methods] Methods (dataset construction): The claim that the 5,600 samples are high-quality and mirror bioinformatics expert workflows lacks sufficient detail on curation criteria, quality filters, and how tasks were operationalized, making it difficult to assess whether performance reflects biological understanding.
[Results] Results (evaluation of contig ID and host prediction): The reported outperformance over random baselines is not accompanied by comparisons to simple feature-based controls (e.g., logistic regression on GC content, dinucleotide frequencies, or k-mer profiles) or composition-matched negative controls; without these, it remains unclear whether gains arise from genuine long-range sequence modeling or from statistical patterns already present in training data.
[Discussion] Discussion: The paper notes weaknesses on complex tasks but provides no ablation studies isolating the role of compositional cues versus biological reasoning, which is load-bearing for the central claim that LLMs demonstrate 'promising potential for genomic understanding.'

minor comments (2)

[Abstract] Abstract: The statement that models 'significantly outperform random baselines' should include at least summary quantitative metrics (accuracy, F1, etc.) and brief mention of controls to allow readers to gauge effect sizes without reading the full text.
[Results] Figures/Tables: Ensure all result tables report error bars, number of runs, and exact random baseline values for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and describe the revisions we will make to improve clarity, robustness, and support for our claims.

read point-by-point responses

Referee: [Methods] Methods (dataset construction): The claim that the 5,600 samples are high-quality and mirror bioinformatics expert workflows lacks sufficient detail on curation criteria, quality filters, and how tasks were operationalized, making it difficult to assess whether performance reflects biological understanding.

Authors: We agree that the Methods section would benefit from greater detail to allow readers to fully evaluate the benchmark construction. In the revised manuscript, we will expand this section with explicit curation criteria (including sequence quality thresholds, completeness metrics, and exclusion rules for ambiguous or low-quality contigs), the specific quality filters applied during sample selection, and a step-by-step description of how each of the five tasks was operationalized to mirror standard expert bioinformatics pipelines. We will also add supplementary material with example processing scripts and decision trees used in dataset creation. revision: yes
Referee: [Results] Results (evaluation of contig ID and host prediction): The reported outperformance over random baselines is not accompanied by comparisons to simple feature-based controls (e.g., logistic regression on GC content, dinucleotide frequencies, or k-mer profiles) or composition-matched negative controls; without these, it remains unclear whether gains arise from genuine long-range sequence modeling or from statistical patterns already present in training data.

Authors: This observation is fair and highlights a way to strengthen the interpretation of our results. Our initial design used random baselines to establish that LLMs extract non-random signal from raw nucleotide sequences, which was the core question for a new benchmark. To directly address the concern about compositional cues, we will add in the revision a set of simple baseline comparisons using logistic regression and random forest classifiers trained on GC content, dinucleotide frequencies, and k-mer profiles for the contig identification and host prediction tasks. These will be reported alongside the LLM results with appropriate statistical tests. revision: yes
Referee: [Discussion] Discussion: The paper notes weaknesses on complex tasks but provides no ablation studies isolating the role of compositional cues versus biological reasoning, which is load-bearing for the central claim that LLMs demonstrate 'promising potential for genomic understanding.'

Authors: We acknowledge that explicit ablations would help substantiate the distinction between statistical pattern matching and deeper biological reasoning. The current manuscript reports performance gaps on long-range and fine-grained tasks but does not include targeted ablations. In the revised version, we will add ablation experiments (e.g., sequence shuffling while preserving composition, regional masking, and comparison of local vs. global context windows) for the complex tasks. These results will be used to refine the discussion and temper the interpretation of 'promising potential' accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with no derivations or self-referential predictions

full rationale

The paper presents PhageBench as a new dataset of 5,600 samples across five tasks and reports direct empirical results from evaluating eight LLMs against random baselines. No equations, first-principles derivations, fitted parameters, or predictions appear in the provided text. Central claims rest on observed performance differences rather than any reduction to inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the benchmark design or results. This is a standard empirical evaluation paper whose load-bearing steps are external measurements, not internal tautologies.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or theoretical claims; the paper rests on standard practices for creating LLM benchmarks and bioinformatics task definitions.

pith-pipeline@v0.9.0 · 5493 in / 1159 out tokens · 51112 ms · 2026-05-10T19:16:07.021918+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

bioRxiv, pages 2025–06

Alphagenome: advancing regulatory variant effect prediction with a unified dna sequence model. bioRxiv, pages 2025–06. Dennis A Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, James Ostell, Kim D Pruitt, and Eric W Sayers. 2018. Genbank.Nucleic acids re- search, 46(D1):D41–D47. Garyk Brixi, Matthew G Durrant, Jerome Ku, Michael Poli, Greg Broc...

work page 2025
[2]

Phage diversity, genomics and phylogeny.Na- ture Reviews Microbiology, 18(3):125–138. Google. 2025. Gemini 3 flash: Best for frontier intelli- gence at speed. Accessed: 2025-12-17. Ann C Gregory, Olivier Zablocki, Ahmed A Zayed, Allison Howell, Benjamin Bolduc, and Matthew B Sullivan. 2020. The gut virome database reveals age- dependent patterns of virome...

work page 2025
[3]

Cheng Peng, Jiayu Shang, Jiaojiao Guan, Donglin Wang, and Yanni Sun

Viralqc: A tool for assessing completeness and contamination of predicted viral contigs.arXiv preprint arXiv:2504.05790. Cheng Peng, Jiayu Shang, Jiaojiao Guan, Donglin Wang, and Yanni Sun. 2024. Viralm: empowering virus dis- covery through the genome foundation model.Bioin- formatics, 40(12):btae704. Akansha Prasad, Shadman Khan, Fatima Arshad, Hareet Si...

work page arXiv 2024
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhihan Zhou, Yanrong Ji, Weijian Li, Pratik Dutta, Ra- mana Davuluri, and Han Liu. 2023. Dnabert-2: Ef- ficient foundation model and benchmark for multi- species genome.arXiv preprint arXiv:2306.15006. A Broader Impact This work extends beyond specific phage genome annotation tasks, holding significa...

work page internal anchor Pith review Pith/arXiv arXiv 2023