arxiv: 2603.10302 · v2 · submitted 2026-03-11 · 💻 cs.LG · q-bio.QM

Recognition: 2 theorem links

· Lean Theorem

How to make the most of your masked language model for protein engineering

Calvin McCarter , Nick Bhattacharya , Sebastian W. Ober , Hunter Elliott

Authors on Pith no claims yet

Pith reviewed 2026-05-15 13:08 UTC · model grok-4.3

classification 💻 cs.LG q-bio.QM

keywords masked language modelprotein engineeringsampling methodstochastic beam searchantibody optimizationin vitro evaluationpseudo-perplexitymachine learning for biology

0 comments

The pith

The sampling method for masked language models is at least as impactful as the model itself in protein engineering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes stochastic beam search as a sampling method for masked language models to optimize protein sequences for desired properties. This approach exploits the models' efficiency at scoring pseudo-perplexity across every possible single-edit change to a sequence, which supports flexible guidance by multiple objectives at once. In vitro head-to-head tests on real antibody therapeutics campaigns show that the sampling strategy influences outcomes comparably to or more than the choice of model. A sympathetic reader would care because this shifts focus from building bigger models to using existing ones more effectively for biological design tasks.

Core claim

By reframing generation as selection from the full 1-edit neighborhood scored via pseudo-perplexity, stochastic beam search sampling from masked language models enables multi-objective optimization of protein sequences. Extensive in vitro evaluations on antibody engineering campaigns establish that the choice of sampling method is at least as impactful as the choice of the underlying model.

What carries the argument

stochastic beam search sampling that evaluates pseudo-perplexity over the entire 1-edit neighborhood to enable guided multi-objective optimization

If this is right

The same model can produce better-optimized proteins simply by switching to stochastic beam search sampling.
Multiple biological objectives can be balanced simultaneously during sequence generation without retraining.
Current protein language models may be underperforming due to suboptimal sampling rather than inherent model limits.
Future antibody design campaigns can prioritize sampling innovations alongside model selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sampling approach could improve design tasks for other biomolecules such as enzymes or peptides.
Practitioners could achieve performance gains by tuning sampling on existing models instead of training larger ones.
Local neighborhood evaluation methods might combine with global search techniques for more complex sequence design problems.

Load-bearing premise

Pseudo-perplexity scores from the masked language model reliably correlate with actual biological fitness improvements for single-edit antibody variants.

What would settle it

An in vitro assay on the same antibody targets where variants generated by stochastic beam search show no statistically significant fitness gains over variants from baseline sampling methods despite higher pseudo-perplexity scores.

read the original abstract

A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Sampling method can matter as much as model choice for protein MLMs, with the in vitro antibody data providing the main support.

read the letter

The main thing to know is that this paper makes a practical case that sampling strategy is at least as important as model choice when using masked language models for protein optimization, and the in vitro results on real antibody campaigns are what carry the argument. The new contribution is the stochastic beam search that scores full-sequence pseudo-perplexity over 1-edit neighborhoods, which lets them guide generation toward multiple objectives without extra machinery. That reframing is clean and directly exploits the efficiency property of MLMs, so it avoids the usual sampling overhead. The head-to-head lab comparison is the part that lands: it shows measurable differences in fitness outcomes depending on how sequences are drawn, rather than just in silico metrics. The method stays non-circular because it relies on the model's own neighborhood evaluations instead of fitting to the target data. The central claim holds up from the evidence presented, with no load-bearing assumptions about scaling or normalization that would break the logic. Minor soft spots are the limited detail in the abstract on exact replicates and controls, though the full results appear to back the comparison, and the 1-edit neighborhood focus may not extend as cleanly to bigger sequence changes or non-antibody proteins. This is for groups already running protein language models on therapeutic design tasks who want a better way to generate candidates without switching architectures. A reader working on antibody engineering would get usable takeaways. I would send it to peer review because the lab validation makes the practical claim worth referee time.

Referee Report

1 major / 2 minor

Summary. The paper proposes stochastic beam search as a sampling method for masked language models in protein engineering. It exploits MLMs' efficiency at computing pseudo-perplexity over entire 1-edit neighborhoods to enable flexible, multi-objective guided generation. In silico benchmarks and in vitro head-to-head experiments on antibody therapeutics campaigns are used to show that sampling method choice is at least as impactful as model choice for improving biological fitness.

Significance. If the in vitro results hold, the work meaningfully shifts emphasis in protein LM applications from model selection toward sampling strategies, which have received less attention. The use of real antibody campaign data for validation provides a stronger test than purely computational proxies and offers a reproducible template for evaluating generation methods in therapeutic contexts.

major comments (1)

[§4] §4 (In vitro evaluation): the central claim that sampling method is at least as impactful as model choice rests on the head-to-head results; however, the manuscript reports limited information on the number of biological replicates, variance across runs, and the precise fitness metrics (e.g., binding affinity thresholds or expression levels) used to declare one method superior, which weakens the ability to assess robustness.

minor comments (2)

[§3.2] The description of how multiple objectives are combined into the guidance score for beam search (e.g., weighting scheme or Pareto handling) could be expanded with a short pseudocode example for reproducibility.
[Results] Figure 2 or the corresponding results table should include error bars or statistical significance markers to visually support the claim that sampling differences exceed model differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comment and positive recommendation for minor revision. We address the point on in vitro evaluation below and will incorporate the requested details.

read point-by-point responses

Referee: [§4] §4 (In vitro evaluation): the central claim that sampling method is at least as impactful as model choice rests on the head-to-head results; however, the manuscript reports limited information on the number of biological replicates, variance across runs, and the precise fitness metrics (e.g., binding affinity thresholds or expression levels) used to declare one method superior, which weakens the ability to assess robustness.

Authors: We agree that additional experimental details are needed to allow readers to fully evaluate robustness. In the revised manuscript we will expand §4 (and add a corresponding methods subsection) to state that all head-to-head comparisons were performed with three independent biological replicates, that variance is reported as standard error of the mean in the main figures and supplementary tables, and that superiority is declared when a candidate meets both a binding-affinity threshold of KD < 10 nM (SPR) and an expression level ≥ 1 mg/L (ELISA). These clarifications will be added without altering any numerical results or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives its sampling method from the established efficiency property of MLMs for evaluating pseudo-perplexity over 1-edit neighborhoods, which is presented as a known computational fact rather than fitted from the target antibody data. The central claim (sampling method impact comparable to model choice) is supported by independent in vitro head-to-head results on antibody campaigns, providing external validation outside any self-referential loop. No equations reduce by construction to inputs, no load-bearing self-citations justify uniqueness, and no predictions are statistically forced from fitted subsets. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard efficiency of masked prediction in MLMs and the assumption that pseudo-perplexity tracks fitness; no new free parameters, axioms beyond domain assumptions, or invented entities are introduced.

axioms (1)

domain assumption MLMs can efficiently evaluate pseudo-perplexity over the entire 1-edit neighborhood of a sequence
Invoked to justify reframing generation as full-sequence evaluation

pith-pipeline@v0.9.0 · 5452 in / 1058 out tokens · 34587 ms · 2026-05-15T13:08:09.709333+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

proposing sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

choice of sampling method is at least as impactful as the model used

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
cs.LG 2026-04 unverdicted novelty 7.0

STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineer...