ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Amir Karimi; Jonathan Golob; Patrick Schwab; Stefan Bauer; Yichen Zhou

arxiv: 2603.06740 · v2 · pith:D5X4KXDMnew · submitted 2026-03-06 · 🧬 q-bio.QM · cs.AI

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Yichen Zhou , Jonathan Golob , Amir Karimi , Stefan Bauer , Patrick Schwab This is my paper

Pith reviewed 2026-05-21 12:32 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AI

keywords protein language modelsdeep mutational scanningviral proteinsSARS-CoV-2mutation predictionbenchmarksinfluenza neutralisationfitness landscapes

0 comments

The pith

Deep mutational scanning and neutralisation assays identify protein language models that generalise to forecasting real SARS-CoV-2 mutations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ViroGym, a benchmark that tests protein language models across dozens of lab experiments on viral proteins and a real-world task of predicting mutations that appeared during the SARS-CoV-2 pandemic. It evaluates established models on fitness effects from 79 deep mutational scanning datasets, antibody neutralisation of influenza variants, and actual pandemic emergence data. The central finding is that models performing strongly on the two types of laboratory measurements also succeed at identifying mutations that spread in nature, even when the specific mutations each method highlights show little overlap. This indicates that the lab assays together capture the main evolutionary rules that allow certain viral changes to succeed outside the test tube. If the result holds, it offers a practical way to rank models for anticipating future viral evolution using existing experimental data rather than waiting for new outbreaks.

Core claim

The ProGen2 family of protein language models achieves the strongest performance across fitness landscapes, antigenic diversity, and pandemic forecasting in the ViroGym benchmark. DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.

What carries the argument

ViroGym benchmark, which integrates 79 deep mutational scanning assays covering 552,065 mutated sequences across seven phenotypic readouts, 21 influenza neutralisation tasks, and a SARS-CoV-2 real-world pandemic prediction task to rank protein language models.

If this is right

Models selected by strong DMS and neutralisation scores can be deployed for proactive forecasting of viral variant emergence.
Combining multiple in vitro assay types supplies a fuller set of evolutionary constraints than any single assay category alone.
Existing laboratory datasets become a reliable filter for choosing protein language models suited to virological prediction tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same benchmark structure could be applied to other viruses of pandemic concern to test whether the same selection pattern holds.
High-performing models identified this way might be used to prioritise candidate mutations for early experimental validation in surveillance programs.
The minimal overlap between mutation sets surfaced by different tasks suggests that future benchmarks could deliberately diversify assay types to cover more constraints.

Load-bearing premise

That the 79 DMS assays and 21 influenza neutralisation tasks together provide a sufficiently representative sample of the evolutionary constraints that actually determine which mutations succeed in real pandemics such as SARS-CoV-2.

What would settle it

A protein language model that ranks highest on the DMS and neutralisation tasks within ViroGym but fails to accurately rank the mutations that actually rose to high frequency in circulating SARS-CoV-2 populations would falsify the generalisation claim.

read the original abstract

Protein language models (pLMs) have shown strong potential for zero-shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real-world pandemic prediction task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces ViroGym, a benchmark for protein language models consisting of 79 DMS assays (552,065 mutated sequences across 7 phenotypic readouts for eukaryotic viruses), 21 influenza neutralisation tasks, and a SARS-CoV-2 real-world emergence prediction task. It reports that the ProGen2 family achieves the strongest performance across all tasks and claims that DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets surfaced by the benchmarks barely overlap.

Significance. If substantiated, the work supplies a large-scale, multi-task resource that could accelerate development of pLMs for viral evolution forecasting and pandemic preparedness. The scale of the DMS collection and the reported link between in-vitro performance and real-world generalisation would be useful contributions to computational virology.

major comments (1)

Abstract: the claim that DMS and neutralisation performance 'reliably identifies models that generalise to real-world emergence' is load-bearing for the paper's central interpretation. No ablation is described that matches models on parameter count, training data volume, or pre-training objective; without such controls it remains possible that ProGen2's lead simply reflects general sequence-modeling capacity rather than capture of specific evolutionary constraints, especially given the reported minimal overlap in surfaced mutations.

minor comments (1)

The abstract states performance rankings and a generalisation result but provides no details on statistical controls, exact baseline implementations, or the method used to quantify mutation-set overlap; these should be added to the methods and results sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the major comment point by point below, focusing on the interpretation of our results and the request for additional controls.

read point-by-point responses

Referee: Abstract: the claim that DMS and neutralisation performance 'reliably identifies models that generalise to real-world emergence' is load-bearing for the paper's central interpretation. No ablation is described that matches models on parameter count, training data volume, or pre-training objective; without such controls it remains possible that ProGen2's lead simply reflects general sequence-modeling capacity rather than capture of specific evolutionary constraints, especially given the reported minimal overlap in surfaced mutations.

Authors: We thank the referee for this important observation. Our manuscript benchmarks a diverse collection of established pLMs that differ in scale, architecture, and training data, and we report that the ProGen2 family shows the strongest performance across the DMS, neutralisation, and real-world emergence tasks. The minimal overlap between the mutation sets surfaced by DMS and neutralisation assays is presented in the paper as evidence that these benchmarks are complementary rather than redundant, which we argue supports the broader claim that strong in-vitro performance can help identify models with better real-world generalisation. At the same time, we acknowledge that the manuscript does not contain explicit ablations that hold parameter count, training data volume, or pre-training objective fixed. In the revised version we will add a dedicated paragraph in the Discussion that summarises the characteristics (size and data volume) of the evaluated models and will revise the abstract wording from 'reliably identifies' to 'provides evidence that' to reflect the correlational nature of the current results. We will also include supplementary correlation plots between model scale and task performance where such metadata are publicly available. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper evaluates pLMs on external pre-existing DMS assays, influenza neutralisation tasks, and independent real-world SARS-CoV-2 emergence data as ground truth. No equations, fitted parameters, or self-citations are shown to reduce the reported generalization result (DMS/neutralisation identifying real-world performance) to a quantity defined by the authors' own choices or inputs. The central claim rests on correlation across independent benchmarks rather than any definitional or self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new free parameters, axioms, or invented entities. It relies on existing protein language models and previously published DMS and neutralisation datasets as inputs.

pith-pipeline@v0.9.0 · 5709 in / 1280 out tokens · 29362 ms · 2026-05-21T12:32:17.867439+00:00 · methodology

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)