ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins
Pith reviewed 2026-05-21 12:32 UTC · model grok-4.3
The pith
Deep mutational scanning and neutralisation assays identify protein language models that generalise to forecasting real SARS-CoV-2 mutations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The ProGen2 family of protein language models achieves the strongest performance across fitness landscapes, antigenic diversity, and pandemic forecasting in the ViroGym benchmark. DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.
What carries the argument
ViroGym benchmark, which integrates 79 deep mutational scanning assays covering 552,065 mutated sequences across seven phenotypic readouts, 21 influenza neutralisation tasks, and a SARS-CoV-2 real-world pandemic prediction task to rank protein language models.
If this is right
- Models selected by strong DMS and neutralisation scores can be deployed for proactive forecasting of viral variant emergence.
- Combining multiple in vitro assay types supplies a fuller set of evolutionary constraints than any single assay category alone.
- Existing laboratory datasets become a reliable filter for choosing protein language models suited to virological prediction tasks.
Where Pith is reading between the lines
- The same benchmark structure could be applied to other viruses of pandemic concern to test whether the same selection pattern holds.
- High-performing models identified this way might be used to prioritise candidate mutations for early experimental validation in surveillance programs.
- The minimal overlap between mutation sets surfaced by different tasks suggests that future benchmarks could deliberately diversify assay types to cover more constraints.
Load-bearing premise
That the 79 DMS assays and 21 influenza neutralisation tasks together provide a sufficiently representative sample of the evolutionary constraints that actually determine which mutations succeed in real pandemics such as SARS-CoV-2.
What would settle it
A protein language model that ranks highest on the DMS and neutralisation tasks within ViroGym but fails to accurately rank the mutations that actually rose to high frequency in circulating SARS-CoV-2 populations would falsify the generalisation claim.
read the original abstract
Protein language models (pLMs) have shown strong potential for zero-shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real-world pandemic prediction task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ViroGym, a benchmark for protein language models consisting of 79 DMS assays (552,065 mutated sequences across 7 phenotypic readouts for eukaryotic viruses), 21 influenza neutralisation tasks, and a SARS-CoV-2 real-world emergence prediction task. It reports that the ProGen2 family achieves the strongest performance across all tasks and claims that DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets surfaced by the benchmarks barely overlap.
Significance. If substantiated, the work supplies a large-scale, multi-task resource that could accelerate development of pLMs for viral evolution forecasting and pandemic preparedness. The scale of the DMS collection and the reported link between in-vitro performance and real-world generalisation would be useful contributions to computational virology.
major comments (1)
- Abstract: the claim that DMS and neutralisation performance 'reliably identifies models that generalise to real-world emergence' is load-bearing for the paper's central interpretation. No ablation is described that matches models on parameter count, training data volume, or pre-training objective; without such controls it remains possible that ProGen2's lead simply reflects general sequence-modeling capacity rather than capture of specific evolutionary constraints, especially given the reported minimal overlap in surfaced mutations.
minor comments (1)
- The abstract states performance rankings and a generalisation result but provides no details on statistical controls, exact baseline implementations, or the method used to quantify mutation-set overlap; these should be added to the methods and results sections.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address the major comment point by point below, focusing on the interpretation of our results and the request for additional controls.
read point-by-point responses
-
Referee: Abstract: the claim that DMS and neutralisation performance 'reliably identifies models that generalise to real-world emergence' is load-bearing for the paper's central interpretation. No ablation is described that matches models on parameter count, training data volume, or pre-training objective; without such controls it remains possible that ProGen2's lead simply reflects general sequence-modeling capacity rather than capture of specific evolutionary constraints, especially given the reported minimal overlap in surfaced mutations.
Authors: We thank the referee for this important observation. Our manuscript benchmarks a diverse collection of established pLMs that differ in scale, architecture, and training data, and we report that the ProGen2 family shows the strongest performance across the DMS, neutralisation, and real-world emergence tasks. The minimal overlap between the mutation sets surfaced by DMS and neutralisation assays is presented in the paper as evidence that these benchmarks are complementary rather than redundant, which we argue supports the broader claim that strong in-vitro performance can help identify models with better real-world generalisation. At the same time, we acknowledge that the manuscript does not contain explicit ablations that hold parameter count, training data volume, or pre-training objective fixed. In the revised version we will add a dedicated paragraph in the Discussion that summarises the characteristics (size and data volume) of the evaluated models and will revise the abstract wording from 'reliably identifies' to 'provides evidence that' to reflect the correlational nature of the current results. We will also include supplementary correlation plots between model scale and task performance where such metadata are publicly available. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper evaluates pLMs on external pre-existing DMS assays, influenza neutralisation tasks, and independent real-world SARS-CoV-2 emergence data as ground truth. No equations, fitted parameters, or self-citations are shown to reduce the reported generalization result (DMS/neutralisation identifying real-world performance) to a quantity defined by the authors' own choices or inputs. The central claim rests on correlation across independent benchmarks rather than any definitional or self-referential reduction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.