Transformer-Based Active Learning for Data-Efficient Vaccine Epitope Selection in PRRS

Aspen Erlandsson Brisebois; Brook Byrns; Connor Burbridge; Gordon Broderick; Heather L. Wilson; Steven Rayan; Sureesh Tikoo; Zahed Khatooni

arxiv: 2606.28659 · v1 · pith:H26FHFNKnew · submitted 2026-06-27 · 🧬 q-bio.BM · cs.LG

Transformer-Based Active Learning for Data-Efficient Vaccine Epitope Selection in PRRS

Aspen Erlandsson Brisebois , Zahed Khatooni , Connor Burbridge , Brook Byrns , Heather L. Wilson , Sureesh Tikoo , Steven Rayan , Gordon Broderick This is my paper

Pith reviewed 2026-06-30 09:01 UTC · model grok-4.3

classification 🧬 q-bio.BM cs.LG

keywords active learningtransformerepitope selectionPRRSvaccine designmolecular dockingmachine learningdata-efficient classification

0 comments

The pith

Transformer models with active learning classify PRRS epitopes accurately using half the docking data of standard baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests machine learning approaches, especially small transformer sequence models, inside an active learning loop to predict which 9-mer epitopes bind strongly to a conserved swine receptor for PRRS vaccine design. The models are trained on an internally generated set of 80 epitope-SLA docking affinities, each computed with over 48 hours of high-performance computing. Under moderate data availability the optimized transformer configuration beats a baseline model trained on twice as many examples, and at 60 examples it reaches 86.8 percent accuracy, matching an estimated upper limit set by conformational noise. A reader would care because docking is the main computational bottleneck in epitope screening, so any method that reduces the number of simulations required could speed up candidate selection for vaccines. The authors reach these results by running large-scale hyperparameter searches over model families, training settings, acquisition policies, and ensemble rules while averaging across many balanced data splits to control for selection effects.

Core claim

Transformer-based sequence models trained with active incremental learning consistently outperform linear, MLP, and CNN alternatives and a random-acquisition baseline; at N=30 the best configuration exceeds the accuracy of a standard model trained on 60 examples, while at N=60 it attains 86.8 percent accuracy, consistent with an 85 percent upper bound derived from two independent estimates of conformational noise in the docking labels.

What carries the argument

Pool-based active learning loop that selects the most informative unlabeled 9-mer epitopes for expensive docking simulation, using hyperparameter optimization across model architecture, training configuration, acquisition policy, and ensemble decision rules.

If this is right

Active incremental learning yields significant gains over random sample acquisition across all tested model families.
Transformer architectures emerge as the strongest performer under strict low-data conditions.
At N=30 the optimized model exceeds the accuracy of a baseline trained on twice the data.
At N=60 the same model reaches 86.8 percent accuracy, matching the conformational-noise upper bound.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same active-learning setup could be applied to epitope selection for other swine or human viruses where docking remains the rate-limiting step.
If the 9-mer representation misses longer-range sequence features, extending the input to full-length peptides would be a direct next test.
Replacing or augmenting the docking labels with experimental binding data would provide a clearer test of whether the reported accuracy gains survive outside the simulation.

Load-bearing premise

The 80 docking affinities are treated as reliable ground truth whose only uncertainty comes from conformational noise rather than systematic biases in the docking protocol itself.

What would settle it

Wet-lab binding measurements on the epitopes selected by the model show substantially lower agreement with the predicted high-affinity labels than the 85 percent noise-limited ceiling would allow.

read the original abstract

High-fidelity molecular docking simulations can produce biologically relevant estimates of epitope-receptor binding affinity but are computationally expensive and therefore limit the number of candidates that can be screened for vaccine design. In this work, we evaluate machine learning (ML) approaches where variants of active learning are used to classify instances of high binding affinity between 9-mer epitopes and a well-conserved swine leukocyte antigen (SLA) receptor in the context of Porcine Reproductive and Respiratory Syndrome (PRRS). We use an internally generated dataset of 80 epitope-SLA docking affinities, each requiring more than 48 hours of high-performance computing (HPC). Multiple model families (linear, MLP, CNN, and a small transformer) are trained under strict low-data conditions within a pool-based active learning loop. In each case, optimal model configurations are identified by conducting large-scale hyperparameter optimization over the combined space of model architecture, training configuration, acquisition policy, and ensemble decision rules. To mitigate the effects of data subsample selection, each candidate configuration is evaluated by averaging performance over many randomized and balanced training and validation data subsets. Across experiments, transformer-based sequence models consistently emerged as the best-performing architecture, with active incremental learning yielding significant improvement over a baseline random sample acquisition strategy. Under moderate training data availability (N=30), the optimized ML-model configuration outperforms a standard baseline trained on twice the amount of data. Under higher training data availability (N=60), the same configuration achieves a peak accuracy of 86.8%, consistent with an upper bound of 85% classification accuracy based on two independent estimates of conformational noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transformer active learning beats random on 80 docking PRRS epitopes but docking labels have no external check.

read the letter

The core result is that a small transformer with active learning reaches 86.8% accuracy at N=60 on classifying high-affinity 9-mer epitopes against an SLA receptor, beating random sampling and even a baseline trained on twice the data at N=30. The numbers sit close to the 85% noise upper bound they derive from two conformational estimates.

The work does a few things cleanly. It runs a broad hyperparameter sweep over architectures, acquisition policies, and ensemble rules, then averages performance across many randomized balanced splits to reduce selection noise. That produces stable comparisons rather than single-run artifacts. The transformer consistently comes out on top under the low-data regime, which is the practical point for stretching expensive docking runs.

The load-bearing assumption is that the 80 internal docking affinities are unbiased enough to serve as ground truth. The abstract supplies noise estimates but no cross-validation against experimental affinities, known PRRS epitopes, or orthogonal assays. If the docking protocol carries systematic bias from force-field limits or 9-mer truncation, the reported data-efficiency gains stay tied to the proxy and may not translate to real epitope selection. The dataset is also small and single-source, so claims stay narrow.

This is a useful case study for people doing computational vaccinology on livestock pathogens who already run docking and want to screen more candidates under tight compute. It is not a new algorithm or a broad methodological advance.

It should go to peer review. The empirical setup is reproducible enough on the ML side and the numbers are concrete, even though the docking validation gap needs to be addressed in revision.

Referee Report

3 major / 1 minor

Summary. The manuscript describes the application of transformer-based models within a pool-based active learning framework to classify 9-mer epitopes with high binding affinity to the SLA receptor for PRRS vaccine design. Using an internally generated dataset of 80 docking affinities, the authors report that optimized active learning configurations outperform random sampling baselines, with a model trained on N=30 active samples exceeding the performance of a baseline on N=60 samples, and achieving a peak accuracy of 86.8% at N=60, which aligns with an estimated 85% upper bound derived from conformational noise estimates.

Significance. If the docking-derived labels prove reliable, the work demonstrates a promising route to data-efficient epitope selection that could substantially reduce the computational burden of high-fidelity docking in vaccine design pipelines. The comprehensive hyperparameter search across architectures, acquisition functions, and ensemble rules, combined with repeated evaluation on randomized balanced subsets, provides a solid empirical foundation for the active learning gains. The consistency with the noise-derived accuracy ceiling adds credibility to the performance numbers.

major comments (3)

[Abstract] The performance claims (N=30 active learning outperforming N=60 random; 86.8% peak accuracy) rest on the assumption that the 80 internally generated docking affinities constitute unbiased ground-truth labels. No external validation against experimental binding affinities, literature values for known PRRS epitopes, or orthogonal assays is reported, leaving open the possibility that force-field inaccuracies, limited receptor flexibility, or 9-mer truncation introduce systematic biases that would invalidate the claimed data-efficiency gains for real vaccine epitope selection.
[Methods (implied from abstract description of hyperparameter search)] It is unclear whether the large-scale hyperparameter optimization over model architecture, training configuration, acquisition policy, and ensemble rules was conducted in a nested fashion inside the active-learning loop or performed on the full dataset, which could lead to optimistic bias in the reported improvements.
[Abstract] Details on how the 80 docking runs were generated, how positive/negative class balance was enforced in the low-data regimes (N=30 and N=60), and the precise procedure for the two independent conformational noise estimates that yield the 85% upper bound are not provided, making it difficult to assess the robustness of the noise ceiling and the generalizability of the results.

minor comments (1)

The abstract could benefit from a brief statement on the specific transformer architecture used (e.g., number of layers, attention heads) to allow readers to gauge model scale.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve clarity on methods and limitations while maintaining the core claims.

read point-by-point responses

Referee: [Abstract] The performance claims (N=30 active learning outperforming N=60 random; 86.8% peak accuracy) rest on the assumption that the 80 internally generated docking affinities constitute unbiased ground-truth labels. No external validation against experimental binding affinities, literature values for known PRRS epitopes, or orthogonal assays is reported, leaving open the possibility that force-field inaccuracies, limited receptor flexibility, or 9-mer truncation introduce systematic biases that would invalidate the claimed data-efficiency gains for real vaccine epitope selection.

Authors: We agree this is a substantive limitation: the labels are docking-derived without external experimental corroboration. The manuscript is a computational study demonstrating active learning efficiency under these labels; we have added an explicit Discussion paragraph noting the absence of wet-lab validation, the potential for force-field or truncation biases, and that real-world gains would require such confirmation. The reported consistency with the independent 85% noise ceiling provides internal support but does not substitute for external validation. We cannot supply experimental data in this revision. revision: partial
Referee: [Methods (implied from abstract description of hyperparameter search)] It is unclear whether the large-scale hyperparameter optimization over model architecture, training configuration, acquisition policy, and ensemble rules was conducted in a nested fashion inside the active-learning loop or performed on the full dataset, which could lead to optimistic bias in the reported improvements.

Authors: The hyperparameter search averaged performance across many randomized balanced train/validation subsets drawn from the full 80-sample set to identify stable configurations; each subsequent active-learning run then used fresh, independent splits. We have revised the Methods to state explicitly that the search occurred outside the active-learning loops but employed nested cross-validation on the full set to avoid leakage or optimistic bias in the final AL performance numbers. revision: yes
Referee: [Abstract] Details on how the 80 docking runs were generated, how positive/negative class balance was enforced in the low-data regimes (N=30 and N=60), and the precise procedure for the two independent conformational noise estimates that yield the 85% upper bound are not provided, making it difficult to assess the robustness of the noise ceiling and the generalizability of the results.

Authors: We have expanded the Methods section with: (i) the full docking protocol (software, force field, receptor preparation, 9-mer truncation details, and run parameters), (ii) the exact subsampling procedure that enforced balanced positive/negative classes at N=30 and N=60, and (iii) the two independent noise-estimation methods (replicate docking variance and conformational ensemble sampling) that produced the 85% ceiling. These additions enable reproducibility and direct evaluation of the noise bound. revision: yes

standing simulated objections not resolved

Absence of external experimental validation of the docking-derived labels, which cannot be supplied without new wet-lab experiments outside the scope of the current computational manuscript.

Circularity Check

0 steps flagged

No circularity: empirical ML results on internal docking dataset

full rationale

The paper reports performance metrics (e.g., 86.8% peak accuracy at N=60, outperformance at N=30) obtained by training and evaluating transformer and other models on an internally generated set of 80 docking affinities within an active learning loop. These figures arise from standard hyperparameter optimization, cross-validation over randomized subsets, and comparison to random acquisition baselines; they do not reduce by construction to any fitted parameter, self-citation, or ansatz. The 85% upper bound is stated as coming from two independent noise estimates rather than from the model outputs themselves. No load-bearing self-citation, uniqueness theorem, or renaming of known results is present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that the 80 docking affinities constitute a representative and low-noise training distribution for the SLA receptor. No free parameters are explicitly fitted beyond standard ML hyperparameters; no new physical entities are introduced.

axioms (1)

domain assumption Docking simulation outputs can be treated as ground-truth labels for binding affinity classification after accounting for stated conformational noise.
Invoked when the 85% upper bound is used to contextualize the 86.8% accuracy.

pith-pipeline@v0.9.1-grok · 5854 in / 1523 out tokens · 27652 ms · 2026-06-30T09:01:26.803792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 5 canonical work pages · 5 internal anchors

[1]

Introduction Computational vaccine design faces an unfavorable trade-off between biological fidelity and throughput. While molecular docking and related simulation approaches can produce estimates of protein-protein binding affinity that are directly relevant to prioritiz-ing epitope candidates, they are dauntingly slow and resource intensive to apply at ...

2025
[2]

𝐴 where the subscript 𝜓 denotes the range-specific parameters (min, max, threshold). 𝒩

Methods The overall proposed framework is composed of multiple components, each with features that can be tailored to a specific problem or environment (Figure 1). 2.1. Dataset Preparation and Preprocessing We used an internally generated dataset of 80 epitope-SLA docking results with each row containing (i) a 9-mer epitope amino-acid sequence, (ii) a sca...

1979
[3]

2.2. Problem Formulation and Learning Targets All models in this work are evaluated based on predicted accuracy in a binary clas-sification task consisting of assigning an epitope from a held-out test set to either a Strong or Weak affinity class. While training is conducted using the continuous affinity estimates provided by the docking experiments, the ...

2022
[4]

The model family choice is defined as: 𝑓#∈{Linear (Lin),MLP,CNN,Transformer (Tr)} With the following providing a summary description of each model family: • Linear: A single linear mapping from the flattened one-hot encoded sequence to a scalar output, with no hidden layers or nonlinear transformations. • MLP: A feedforward neural network composed of one ...

1986
[5]

including the choice of search algorithm for model weight identification ℴ (e.g. Adam, etc…), the learning rate 𝐿$ (log-scaled), learning momentum M (in SGD and RMSProp), weight de-cay 𝜆%&, batch size B, and the choice of loss function ℒ conditional on the training target 𝒯. 2.6. Active Learning Framework Each numerical experiment presented here requires ...

2017
[6]

The final prediction is given by 𝑦v(𝑥)=𝕀wx𝑦v25 234(𝑥)≥𝑀2z

• Majority vote: Each ensemble member produces a binary prediction 𝑦v2(𝑥)=𝕀[𝑝2(𝑥)≥0.5]. The final prediction is given by 𝑦v(𝑥)=𝕀wx𝑦v25 234(𝑥)≥𝑀2z. • Mean round: The ensemble mean predicted probability 𝑝¯(𝑥)=1𝑀x𝑝25 234(𝑥) A threshold of 0.5 is then applied to 𝑝¯(𝑥) in order to produce the final predicted class assignment 𝑦v(𝑥). 𝑦v(𝑥)=𝕀[𝑝¯(𝑥)≥0.5] • Confide...

2023
[7]

and perceived as novel or informative, effectively for-going any opportunity for incremental learning. Comprehensively optimizing over a variety of model architectures and acquisition strategies, our results suggest that Transformer-based models, supported by ensemble decision rules that also leverage Expected Gradient Length (EGL) for active-learning sam...

2024
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

References [1]. Guo D, Yang D, Zhang H, Song J, Wang P, Zhu Q, Xu R, Zhang R, Ma S, Bi X, Zhang X. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025 Jan

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

(example reference) [2]. Otsu N. A threshold selection method from gray-level histograms. Automatica. 1975;11:285-96. [3]. Dubey SR, Singh SK, Chaudhuri BB. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing. 2022 Sep 7;503:92-108. [4]. Jadon A, Patil A, Jadon S. A comprehensive survey of regression-based loss func...

work page internal anchor Pith review Pith/arXiv arXiv 1975
[10]

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

[9]. Shukla M. Bayesian Uncertainty and Expected Gradient Length-Regression: Two Sides Of The Same Coin?. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2022 (pp. 2367-2376). [10]. Watanabe S. Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Yang Z, Zeng X, Zhao Y, Chen R

[11]. Yang Z, Zeng X, Zhao Y, Chen R. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy. 2023 Mar 14;8(1):115. [12]. Malviya B, Alisher Y, Sharma M, Hussein L, Yaswitha G. BERT-based Models for Predicting Protein-Protein Interaction Sites. In2024 IEEE International Conference on Communication, ...

2023
[12]

Dataset Augmentation in Feature Space

[15]. Szymborski J, Emad A. A flaw in using pretrained protein language models in protein–protein interaction inference models. Nature Machine Intelligence. 2026 Feb 13:1-2. [16]. de Vries S, Thierens D. Learning with confidence: training better classifiers from soft labels. Machine Learning. 2025 Nov;114(11):238. [17]. DeVries T, Taylor GW. Dataset augme...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

On the Dimensionality of Embeddings for Sparse Features and Data

[18]. Naumov M. On the dimensionality of embeddings for sparse features and data. arXiv preprint arXiv:1901.02103. 2019 Jan

work page internal anchor Pith review Pith/arXiv arXiv 1901

[1] [1]

Introduction Computational vaccine design faces an unfavorable trade-off between biological fidelity and throughput. While molecular docking and related simulation approaches can produce estimates of protein-protein binding affinity that are directly relevant to prioritiz-ing epitope candidates, they are dauntingly slow and resource intensive to apply at ...

2025

[2] [2]

𝐴 where the subscript 𝜓 denotes the range-specific parameters (min, max, threshold). 𝒩

Methods The overall proposed framework is composed of multiple components, each with features that can be tailored to a specific problem or environment (Figure 1). 2.1. Dataset Preparation and Preprocessing We used an internally generated dataset of 80 epitope-SLA docking results with each row containing (i) a 9-mer epitope amino-acid sequence, (ii) a sca...

1979

[3] [3]

2.2. Problem Formulation and Learning Targets All models in this work are evaluated based on predicted accuracy in a binary clas-sification task consisting of assigning an epitope from a held-out test set to either a Strong or Weak affinity class. While training is conducted using the continuous affinity estimates provided by the docking experiments, the ...

2022

[4] [4]

The model family choice is defined as: 𝑓#∈{Linear (Lin),MLP,CNN,Transformer (Tr)} With the following providing a summary description of each model family: • Linear: A single linear mapping from the flattened one-hot encoded sequence to a scalar output, with no hidden layers or nonlinear transformations. • MLP: A feedforward neural network composed of one ...

1986

[5] [5]

including the choice of search algorithm for model weight identification ℴ (e.g. Adam, etc…), the learning rate 𝐿$ (log-scaled), learning momentum M (in SGD and RMSProp), weight de-cay 𝜆%&, batch size B, and the choice of loss function ℒ conditional on the training target 𝒯. 2.6. Active Learning Framework Each numerical experiment presented here requires ...

2017

[6] [6]

The final prediction is given by 𝑦v(𝑥)=𝕀wx𝑦v25 234(𝑥)≥𝑀2z

• Majority vote: Each ensemble member produces a binary prediction 𝑦v2(𝑥)=𝕀[𝑝2(𝑥)≥0.5]. The final prediction is given by 𝑦v(𝑥)=𝕀wx𝑦v25 234(𝑥)≥𝑀2z. • Mean round: The ensemble mean predicted probability 𝑝¯(𝑥)=1𝑀x𝑝25 234(𝑥) A threshold of 0.5 is then applied to 𝑝¯(𝑥) in order to produce the final predicted class assignment 𝑦v(𝑥). 𝑦v(𝑥)=𝕀[𝑝¯(𝑥)≥0.5] • Confide...

2023

[7] [7]

and perceived as novel or informative, effectively for-going any opportunity for incremental learning. Comprehensively optimizing over a variety of model architectures and acquisition strategies, our results suggest that Transformer-based models, supported by ensemble decision rules that also leverage Expected Gradient Length (EGL) for active-learning sam...

2024

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

References [1]. Guo D, Yang D, Zhang H, Song J, Wang P, Zhu Q, Xu R, Zhang R, Ma S, Bi X, Zhang X. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. 2025 Jan

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

(example reference) [2]. Otsu N. A threshold selection method from gray-level histograms. Automatica. 1975;11:285-96. [3]. Dubey SR, Singh SK, Chaudhuri BB. Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing. 2022 Sep 7;503:92-108. [4]. Jadon A, Patil A, Jadon S. A comprehensive survey of regression-based loss func...

work page internal anchor Pith review Pith/arXiv arXiv 1975

[10] [10]

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

[9]. Shukla M. Bayesian Uncertainty and Expected Gradient Length-Regression: Two Sides Of The Same Coin?. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision 2022 (pp. 2367-2376). [10]. Watanabe S. Tree-structured parzen estimator: Understanding its algorithm components and their roles for better empirical performance. arXi...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Yang Z, Zeng X, Zhao Y, Chen R

[11]. Yang Z, Zeng X, Zhao Y, Chen R. AlphaFold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy. 2023 Mar 14;8(1):115. [12]. Malviya B, Alisher Y, Sharma M, Hussein L, Yaswitha G. BERT-based Models for Predicting Protein-Protein Interaction Sites. In2024 IEEE International Conference on Communication, ...

2023

[12] [12]

Dataset Augmentation in Feature Space

[15]. Szymborski J, Emad A. A flaw in using pretrained protein language models in protein–protein interaction inference models. Nature Machine Intelligence. 2026 Feb 13:1-2. [16]. de Vries S, Thierens D. Learning with confidence: training better classifiers from soft labels. Machine Learning. 2025 Nov;114(11):238. [17]. DeVries T, Taylor GW. Dataset augme...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[13] [13]

On the Dimensionality of Embeddings for Sparse Features and Data

[18]. Naumov M. On the dimensionality of embeddings for sparse features and data. arXiv preprint arXiv:1901.02103. 2019 Jan

work page internal anchor Pith review Pith/arXiv arXiv 1901