pith. sign in

arxiv: 2605.16581 · v1 · pith:XKSXN5PAnew · submitted 2026-05-15 · 💻 cs.LG

Structure-Aware Masking for Protein Representation Learning

Pith reviewed 2026-05-20 19:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords protein language modelsmasked language modelingstructure-aware maskingbucket maskingprotein fitness predictionlong-range interactionsmutational effectsinductive bias
0
0 comments X

The pith

Protein language models learn better when masking targets residues that are close together in 3D structure instead of choosing them at random.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bucket Masking to replace the usual random masking used when pretraining protein language models. Rather than hiding single residues independently, it identifies groups of residues that sit near one another in the folded protein and masks whole groups at once. This change makes the training objective focus on the long-range couplings that actually determine how a protein works. On four separate tasks that measure how well models predict the effects of mutations, the new masking method raises performance by as much as 14 percent, with the largest gains appearing when several mutations interact. Controlled tests show the benefit comes from the structural placement of the masks rather than from simply masking longer stretches.

Core claim

Bucket Masking is a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher-order mutational interactions. Through controlled ablations, these improvements arise from mask placement rather than span size, establishing

What carries the argument

Bucket Masking: a pretraining procedure that partitions sequence positions into buckets according to 3D residue contacts and then masks entire buckets together.

If this is right

  • Models trained this way become more accurate at forecasting the combined effects of several mutations at once.
  • The placement of masks, not merely their total count or length, supplies a useful positional bias for learning nonlocal dependencies.
  • Downstream fitness predictors improve most on tasks that involve higher-order mutational interactions.
  • The same masking change can be applied to any sequence model that is later evaluated on structure-dependent protein properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If contact maps from predicted structures work nearly as well as experimental ones, the method could be used at scale without needing new laboratory data for every protein.
  • The same idea of grouping positions by geometry might transfer to other sequence domains where spatial or functional proximity matters, such as RNA or small-molecule binding sites.
  • Future pretraining could combine this masking bias with explicit geometric losses to further strengthen the link between sequence representations and 3D structure.

Load-bearing premise

The 3D structural contacts used to form the buckets are both available and capture the functional couplings that determine fitness.

What would settle it

Retraining the same model with buckets formed by randomly grouping residues of the same sizes instead of using actual 3D contacts, then checking whether the 14 percent gain on the fitness tasks disappears.

Figures

Figures reproduced from arXiv: 2605.16581 by Amirali Aghazadeh, Ayan Goel, Thomas Walton.

Figure 1
Figure 1. Figure 1: Overview of random masking and Bucket Masking. a, Multiple sequence alignments (MSAs) provide training data for protein language models (PLMs), implicitly encoding evolutionary constraints. b, Random masking places mask tokens uniformly at random, independent of structural constraints such as long-range contacts (CAPSD_AAV2S, residues 174-230 and 496-542, PDB: 1LP3). c, Representations encoded by random ma… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of data. Seventeen proteins with varying molecular functions are used in this study. A detailed table characterizing each protein is available in Appendix B.2. 4.1 Data Seventeen proteins with varying molecular functions and lengths are assessed in this study, as detailed in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bucket Masking versus random masking. Across 17 proteins, Bucket Masking and random masking are assessed on four tasks, which probe different qualities of learned representations. Each plot details the delta in Spearman ρ for each task. Bucket Masking outperforms random masking on all tasks, and most significantly on regime and position extrapolation. Neighborhood extrapolation. Neighborhood extrapolation … view at source ↗
Figure 4
Figure 4. Figure 4: Quantifying the importance of position in masking. a, A Spearman ρ delta plot of Bucket Masking compared to geometry-matched span masking (GM span), an ablation that determines mask span lengths using Bucket Masking but randomly shuffles the placement of the span. Bucket Masking outperforms GM span on all extrapolation tasks, isolating the importance of the position of the mask from the size of the span. b… view at source ↗
Figure 5
Figure 5. Figure 5: Results across all extrapolation tasks for all proteins and methods [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bucket Masking versus GM span performance on long-range tasks. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three-dimensional structural dependencies and long-range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher-order mutational interactions. Through controlled ablations, we show that these improvements arise from mask placement rather than span size, establishing masking as a positional inductive bias.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Bucket Masking, a structure-aware variant of masked language modeling for protein sequences. Rather than masking individual residues uniformly at random, residues are grouped into buckets according to 3D spatial proximity (via contact maps) and entire buckets are masked together. The authors claim this induces a useful inductive bias for long-range interactions, yielding up to 14% gains over random masking on four downstream protein fitness prediction tasks and particularly improving prediction of higher-order mutational effects. Controlled ablations are presented to attribute the gains to mask placement rather than span length.

Significance. If the central results hold after clarification of contact-map provenance and statistical reporting, the work would demonstrate that a modest change to the pretraining masking distribution can measurably improve modeling of epistatic couplings without requiring structural inputs at inference. The explicit separation of placement from span size in the ablations is a methodological strength that helps isolate the claimed positional bias.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Bucket Masking definition): the claim that masking is conditioned on 'residue contacts' that capture 'functional couplings' is load-bearing for the 14% improvement and the higher-order mutation advantage, yet the manuscript provides no description of contact-map source (experimental PDB entries, AlphaFold predictions, or otherwise), distance threshold, or bucket-construction procedure. Without this, it is impossible to determine whether the reported gains reflect a general property of structure-aware masking or an artifact of the particular contact data used for the pretraining corpus and the four downstream tasks.
  2. [§4] §4 (downstream evaluation and ablations): the quantitative results (up to 14% improvement, advantage on higher-order mutations) are presented without error bars, number of random seeds, or statistical significance tests. Because the central claim rests on these gains being robust and attributable to mask placement, the absence of variance estimates leaves open the possibility that the differences are within run-to-run variability of the fitness predictors.
minor comments (2)
  1. [Methods] The abstract states that improvements 'arise from mask placement rather than span size,' but the precise definition of 'span size' (contiguous residues vs. bucket diameter) and how it is controlled in the ablation should be stated explicitly in the methods for reproducibility.
  2. [Results] Table or figure captions for the four fitness tasks should list the exact datasets, number of variants, and whether contacts were available at test time.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive overall assessment of the work. We have revised the manuscript to address the two major comments by adding the requested details on contact-map construction and by including statistical reporting for the experimental results. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Bucket Masking definition): the claim that masking is conditioned on 'residue contacts' that capture 'functional couplings' is load-bearing for the 14% improvement and the higher-order mutation advantage, yet the manuscript provides no description of contact-map source (experimental PDB entries, AlphaFold predictions, or otherwise), distance threshold, or bucket-construction procedure. Without this, it is impossible to determine whether the reported gains reflect a general property of structure-aware masking or an artifact of the particular contact data used for the pretraining corpus and the four downstream tasks.

    Authors: We agree that the provenance and construction details are necessary for reproducibility and to support the central claims. The revised Section 3 now specifies that contact maps for the pretraining corpus are derived from experimental PDB entries (with AlphaFold models used only for sequences lacking PDB structures), using an 8 Å Cα distance threshold to define contacts. Buckets are formed by computing the connected components of the contact graph and discarding components smaller than a minimum size threshold; pseudocode for this procedure has been added to the appendix. These details establish that the masking strategy relies on standard structural biology definitions of spatial proximity, which prior literature has linked to functional couplings, rather than an idiosyncratic choice of data. revision: yes

  2. Referee: [§4] §4 (downstream evaluation and ablations): the quantitative results (up to 14% improvement, advantage on higher-order mutations) are presented without error bars, number of random seeds, or statistical significance tests. Because the central claim rests on these gains being robust and attributable to mask placement, the absence of variance estimates leaves open the possibility that the differences are within run-to-run variability of the fitness predictors.

    Authors: We concur that variance estimates and significance testing are required to substantiate the reported improvements. The revised Section 4 now reports all metrics as means over five independent random seeds with standard-deviation error bars. We have also added paired t-test p-values comparing Bucket Masking against the random-masking baseline; the improvements remain statistically significant (p < 0.05) on every task. The controlled ablations isolating mask placement from span length are likewise reported with these statistics, reinforcing that the gains arise from the positional bias rather than run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No circularity detected; empirical improvements measured on independent downstream tasks

full rationale

The paper defines Bucket Masking via 3D structural proximity and reports performance gains on four held-out downstream fitness prediction tasks. These gains are evaluated externally rather than being fitted to or defined by the masking distribution itself. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description. The central claim rests on controlled ablations showing gains from mask placement, which are falsifiable against standard random masking baselines. This constitutes a self-contained empirical result with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the standard masked language modeling objective plus the domain assumption that 3D contacts encode functional couplings; no new fitted parameters or invented physical entities are introduced beyond the masking algorithm itself.

axioms (1)
  • domain assumption Masked language modeling remains a suitable pretraining objective when the masking distribution is altered to reflect structural proximity.
    The paper modifies only the masking distribution while keeping the overall MLM loss and model architecture unchanged.
invented entities (1)
  • Bucket Masking no independent evidence
    purpose: To define groups of residues for masking based on 3D spatial proximity.
    A new algorithmic procedure introduced to replace uniform random masking.

pith-pipeline@v0.9.0 · 5702 in / 1186 out tokens · 44962 ms · 2026-05-20T19:22:33.467433+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we construct a residue contact graph from the wild-type (WT) protein structure and partition contacts into distance-based “buckets” according to spatial proximity... τ=7, following empirical evaluations for fold discrimination

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 1 internal anchor

  1. [1]

    A., Mathur, S., Salabert, D., Ballot, J., R´egulo, C., Metcalfe, T

    Zeming Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, March 2023. ISSN 1095-9203. doi: 10.1126/science. ade2574

  2. [2]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapoli...

  3. [3]

    ERNIE: Enhanced language representation with informative entities

    Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. ERNIE: Enhanced language representation with informative entities. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1139

  4. [4]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu et al. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692

  5. [5]

    Weld, Luke Zettlemoyer, and Omer Levy

    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.Transactions of the Association for Computational Linguistics, 8:64–77, December 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00300

  6. [6]

    In: Proceed- ings of the 2020 Conference on Empirical Methods in Natural Language Process- ing (EMNLP)

    Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. LUKE: Deep contextualized entity representations with entity-aware self-attention. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442–6454, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/...

  7. [7]

    Ahmed Elnaggar et al. Prottrans: Toward understanding the language of life through self- supervised learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10): 7112–7127, October 2022. ISSN 1939-3539. doi: 10.1109/tpami.2021.3095381

  8. [8]

    Alexandra Shulman-Peleg, Ruth Nussinov, and Haim J. Wolfson. Recognition of functional sites in protein structures.Journal of Molecular Biology, 339(3):607–633, June 2004. ISSN 0022-2836. doi: 10.1016/j.jmb.2004.04.012

  9. [9]

    Sequence co-evolution gives 3D contacts and structures of protein complexes.eLife, 3, September 2014

    Thomas A Hopf et al. Sequence co-evolution gives 3D contacts and structures of protein complexes.eLife, 3, September 2014. ISSN 2050-084X. doi: 10.7554/elife.03430

  10. [10]

    McDonald, Craig Gambogi, Andrew L

    Jian Wang, Abha Jain, Leanna R. McDonald, Craig Gambogi, Andrew L. Lee, and Niko- lay V . Dokholyan. Mapping allosteric communications within individual proteins.Nature Communications, 11(1), July 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-17618-2

  11. [11]

    Yaliraki

    Nan Wu, Léonie Strömich, and Sophia N. Yaliraki. Prediction of allosteric sites and signaling: Insights from benchmarking datasets.Patterns, 3(1):100408, January 2022. ISSN 2666-3899. doi: 10.1016/j.patter.2021.100408

  12. [12]

    EntityBERT: Entity-centric masking strategy for model pretraining for the clinical domain

    Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard, and Guergana Savova. EntityBERT: Entity-centric masking strategy for model pretraining for the clinical domain. InProceedings of the 20th Workshop on Biomedical Language Processing, pages 191–201, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.bionlp-1.21

  13. [13]

    Sundaram, Wolfgang Nejdl, and Niloy Ganguly

    Soumyadeep Roy, Jonas Wallat, Sowmya S. Sundaram, Wolfgang Nejdl, and Niloy Ganguly. GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. IOS Press, September 2023. ISBN 9781643684376. doi: 10.3233/faia230492

  14. [14]

    Pre-training a BERT with curriculum learning by increasing block-size of input text

    Koichi Nagatsuka, Clifford Broni-Bediako, and Masayasu Atsumi. Pre-training a BERT with curriculum learning by increasing block-size of input text. InProceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications, RANLP 2021, page 989–996. INCOMA Ltd. Shoumen, BULGARIA,...

  15. [15]

    Efficient pre- training of masked language model via concept-based curriculum masking

    Mingyu Lee, Jun-Hyung Park, Junho Kim, Kang-Min Kim, and SangKeun Lee. Efficient pre- training of masked language model via concept-based curriculum masking. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7417–7427, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. d...

  16. [16]

    Learning better masking for better language model pre-training

    Dongjie Yang, Zhuosheng Zhang, and Hai Zhao. Learning better masking for better language model pre-training. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 7255–7267, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.400

  17. [17]

    IOS Press, October 2024

    Soumyadeep Roy, Shamik Sural, and Niloy Ganguly.Unlocking Efficiency: Adaptive Masking for Gene Transformer Models. IOS Press, October 2024. ISBN 9781643685489. doi: 10.3233/ faia240864

  18. [18]

    Fuson-plm: a fusion oncoprotein-specific language model via adjusted rate mask- ing.Nature Communications, 16(1), February 2025

    Sophia Vincoff, Shrey Goel, Kseniia Kholina, Rishab Pulugurta, Pranay Vure, and Pranam Chatterjee. Fuson-plm: a fusion oncoprotein-specific language model via adjusted rate mask- ing.Nature Communications, 16(1), February 2025. ISSN 2041-1723. doi: 10.1038/ s41467-025-56745-6

  19. [19]

    A ConvNet for the 2020s

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2022. doi: 10.1109/CVPR52688.2022. 01553

  20. [20]

    A ConvNet for the 2020s

    Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14648–14658, 2022. doi: 10.1109/ CVPR52688.2022.01426

  21. [21]

    Training compute-optimal protein language models, 2024

    Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, and Le Song. Training compute-optimal protein language models, 2024. URLhttps://arxiv.org/abs/2411.02142

  22. [22]

    Focused learning by antibody language models using preferential masking of non-templated regions.Patterns, 6(6):101239, June 2025

    Karenna Ng and Bryan Briney. Focused learning by antibody language models using preferential masking of non-templated regions.Patterns, 6(6):101239, June 2025. ISSN 2666-3899. doi: 10.1016/j.patter.2025.101239

  23. [23]

    Understanding and enhancing mask-based pretraining towards universal representations

    Mingze Dong, Leda Wang, and Yuval Kluger. Understanding and enhancing mask-based pretraining towards universal representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  24. [24]

    MSA transformer

    Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. MSA transformer. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021

  25. [25]

    Saprot: Protein language modeling with structure-aware vocabulary

    Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. Saprot: Protein language modeling with structure-aware vocabulary. InInternational Conference on Learning Representations, volume 2024, pages 6987–7009, 2024

  26. [26]

    R. Rao, N. Bhattacharya, N. Thomas, Y . Duan, X. Chen, J. Canny, P. Abbeel, and Y . S. Song. Evaluating protein transfer learning with tape. InAdvances in Neural Information Processing Systems, volume 32, pages 9689–9701, Dec 2019. PMID: 33390682; PMCID: PMC7774645

  27. [27]

    Sparse autoencoders for low- n protein function prediction and design, 2025

    Darin Tsui, Kunal Talreja, and Amirali Aghazadeh. Sparse autoencoders for low- n protein function prediction and design, 2025. URLhttps://arxiv.org/abs/2508.18567

  28. [28]

    Golf: A generative ai framework for pathogenicity prediction of myocilin olf variants

    Thomas Walton, Darin Tsui, Lauren Fogel, Dustin Huard, Rafael Chagas, Raquel Lieberman, and Amirali Aghazadeh. Golf: A generative ai framework for pathogenicity prediction of myocilin olf variants. InProceedings of the 20th Machine Learning in Computational Biology meeting, volume 311 ofProceedings of Machine Learning Research, pages 148–161. PMLR, 10–11 ...

  29. [29]

    Strait and T.G

    B.J. Strait and T.G. Dewey. The shannon information entropy of protein sequences.Biophysical Journal, 71(1):148–155, July 1996. ISSN 0006-3495. doi: 10.1016/s0006-3495(96)79210-x

  30. [30]

    The language of proteins: NLP, machine learning and protein sequences.Computational and Structural Biotechnology Journal, 19:1750–1758,

    Dan Ofer, Nadav Brandes, and Michal Linial. The language of proteins: NLP, machine learning and protein sequences.Computational and Structural Biotechnology Journal, 19:1750–1758,

  31. [31]

    doi: 10.1016/j.csbj.2021.03.022

    ISSN 2001-0370. doi: 10.1016/j.csbj.2021.03.022

  32. [32]

    ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics, 38(8): 2102–2110, February 2022

    Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics, 38(8): 2102–2110, February 2022. ISSN 1367-4811. doi: 10.1093/bioinformatics/btac020

  33. [33]

    Chothia and A.M

    C. Chothia and A.M. Lesk. The relation between the divergence of sequence and structure in proteins.The EMBO Journal, 5(4):823–826, April 1986. ISSN 0261-4189. doi: 10.1002/j. 1460-2075.1986.tb04288.x

  34. [34]

    Ardell, and Arne Elofsson

    Kristoffer Illergård, David H. Ardell, and Arne Elofsson. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores.Proteins: Structure, Function, and Bioinformatics, 77(3):499–508, June 2009. ISSN 1097-0134. doi: 10.1002/prot.22458

  35. [35]

    Proteingym: Large-scale benchmarks for protein design and fitness prediction

    Pascal Notin et al. Proteingym: Large-scale benchmarks for protein design and fitness prediction. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023

  36. [36]

    H. M. Berman. The protein data bank.Nucleic Acids Research, 28(1):235–242, January 2000. ISSN 1362-4962. doi: 10.1093/nar/28.1.235

  37. [37]

    UniProt: the uni- versal protein knowledgebase in 2025

    Alex Bateman et al. Uniprot: the universal protein knowledgebase in 2025.Nucleic Acids Research, 53(D1):D609–D617, November 2024. ISSN 1362-4962. doi: 10.1093/nar/gkae1010

  38. [38]

    Effective inter-residue contact definitions for accurate protein fold recognition.BMC Bioinformatics, 13(1), November 2012

    Chao Yuan, Hao Chen, and Daisuke Kihara. Effective inter-residue contact definitions for accurate protein fold recognition.BMC Bioinformatics, 13(1), November 2012. ISSN 1471-

  39. [39]

    doi: 10.1186/1471-2105-13-292

  40. [40]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  41. [41]

    Contrastive fitness learning: Reprogramming protein language models for low-n learning of protein fitness landscape

    Junming Zhao, Chao Zhang, and Yunan Luo. Contrastive fitness learning: Reprogramming protein language models for low-n learning of protein fitness landscape. InInternational Conference on Research in Computational Molecular Biology, pages 470–474. Springer, 2024

  42. [42]

    Fine-tuning protein language models with deep mutational scanning improves variant effect prediction.arXiv preprint arXiv:2405.06729, 2024

    Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, and Stephen Young. Fine-tuning protein language models with deep mutational scanning improves variant effect prediction.arXiv preprint arXiv:2405.06729, 2024

  43. [43]

    Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design.bioRxiv, pages 2024–05, 2024

    Alex Hawkins-Hooker, Shikha Surana, Jack Simons, Jakub Kmec, Oliver Bent, and Paul Duckworth. Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design.bioRxiv, pages 2024–05, 2024

  44. [44]

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025

    Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025. ISSN 2998-

  45. [45]

    doi: 10.1109/taslpro.2025.3606231

  46. [46]

    SpecMER: Fast protein generation with k-mer guided speculative decoding

    Thomas Walton, Darin Tsui, Aryan Musharaf, and Amirali Aghazadeh. SpecMER: Fast protein generation with k-mer guided speculative decoding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= 2sG4ebgqBd

  47. [47]

    Poelwijk, Michael Socolich, and Rama Ranganathan

    Frank J. Poelwijk, Michael Socolich, and Rama Ranganathan. Learning the pattern of epistasis linking genotype and phenotype in a protein.Nature Communications, 10(1), September 2019. ISSN 2041-1723. doi: 10.1038/s41467-019-12130-8. 12

  48. [48]

    Genotype to phenotype mapping and the fitness landscape of the e

    Jakub Otwinowski and Ilya Nemenman. Genotype to phenotype mapping and the fitness landscape of the e. coli lac promoter.PLoS ONE, 8(5):e61570, May 2013. ISSN 1932-6203. doi: 10.1371/journal.pone.0061570

  49. [49]

    Adams, Justin B

    Rhys M. Adams, Justin B. Kinney, Aleksandra M. Walczak, and Thierry Mora. Epistasis in a fitness landscape defined by antibody-antigen binding free energy.Cell Systems, 8(1):86–93.e3, January 2019. ISSN 2405-4712. doi: 10.1016/j.cels.2018.12.004

  50. [50]

    Faure, Aina Martí-Aranda, Cristina Hidalgo-Carcedo, Antoni Beltran, Jörn M

    Andre J. Faure, Aina Martí-Aranda, Cristina Hidalgo-Carcedo, Antoni Beltran, Jörn M. Schmiedel, and Ben Lehner. The genetic architecture of protein stability.Nature, 634(8035): 995–1003, September 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-07966-0

  51. [51]

    Ensemble epistasis: ther- modynamic origins of nonadditivity between mutations.Genetics, 219(1), July 2021

    Anneliese J Morrison, Daria R Wonderlick, and Michael J Harms. Ensemble epistasis: ther- modynamic origins of nonadditivity between mutations.Genetics, 219(1), July 2021. ISSN 1943-2631. doi: 10.1093/genetics/iyab105

  52. [52]

    On recovering higher-order interactions from protein language models, 2024

    Darin Tsui and Amirali Aghazadeh. On recovering higher-order interactions from protein language models, 2024

  53. [53]

    Hatzakis, and Wouter Boomsma

    Richard Michael, Jacob Kæstel-Hansen, Peter Mørch Groth, Simon Bartels, Jesper Salomon, Pengfei Tian, Nikos S. Hatzakis, and Wouter Boomsma. Assessing the performance of protein regression models, June 2023

  54. [54]

    Learning protein fitness landscapes with deep mutational scanning data from multiple sources.Cell Systems, 14(8):706–721.e5, August 2023

    Lin Chen et al. Learning protein fitness landscapes with deep mutational scanning data from multiple sources.Cell Systems, 14(8):706–721.e5, August 2023. ISSN 2405-4712. doi: 10.1016/j.cels.2023.07.003

  55. [55]

    Fahlberg, Pete Heinzelman, Philip A

    Sam Gelman, Sarah A. Fahlberg, Pete Heinzelman, Philip A. Romero, and Anthony Gitter. Neural networks to learn protein sequence–function relationships from deep mutational scanning data.Proceedings of the National Academy of Sciences, 118(48), November 2021. ISSN 1091-

  56. [56]

    doi: 10.1073/pnas.2104878118

  57. [57]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020. doi: 10.1109/CVPR42600. 2020.00975

  58. [58]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 1597–1607, 2020

  59. [59]

    Generative AA V capsid diversifica- tion by latent interpolation, April 2021

    Sam Sinai, Nina Jain, George M Church, and Eric D Kelsic. Generative AA V capsid diversifica- tion by latent interpolation, April 2021

  60. [60]

    Comprehensive fitness maps of hsp90 show widespread environmental dependence.eLife, 9, March 2020

    Julia M Flynn, Ammeret Rossouw, Pamela Cote-Hammarlof, Inês Fragata, David Mavor, Carl Hollins, Claudia Bank, and Daniel NA Bolon. Comprehensive fitness maps of hsp90 show widespread environmental dependence.eLife, 9, March 2020. ISSN 2050-084X. doi: 10.7554/elife.53810

  61. [61]

    Araya, Douglas M

    Carlos L. Araya, Douglas M. Fowler, Wentao Chen, Ike Muniez, Jeffery W. Kelly, and Stanley Fields. A fundamental protein property, thermodynamic stability, revealed solely from large- scale measurements of protein function.Proceedings of the National Academy of Sciences, 109 (42):16858–16863, October 2012. ISSN 1091-6490. doi: 10.1073/pnas.1209751109

  62. [62]

    Anders Olson, Nicholas C

    C. Anders Olson, Nicholas C. Wu, and Ren Sun. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain.Current Biology, 24(22):2643–2651, November 2014. ISSN 0960-9822. doi: 10.1016/j.cub.2014.09.072

  63. [63]

    Staller, Alex S

    Max V . Staller, Alex S. Holehouse, Devjanee Swain-Lenz, Rahul K. Das, Rohit V . Pappu, and Barak A. Cohen. A high-throughput mutational scan of an intrinsically disordered acidic transcriptional activation domain.Cell Systems, 6(4):444–455.e6, April 2018. ISSN 2405-4712. doi: 10.1016/j.cels.2018.01.015. 13

  64. [64]

    Sarkisyan et al

    Karen S. Sarkisyan et al. Local fitness landscape of the green fluorescent protein.Nature, 533 (7603):397–401, May 2016. ISSN 1476-4687. doi: 10.1038/nature17995

  65. [65]

    Heterogeneity of the gfp fitness landscape and data-driven protein design.eLife, 11, May 2022

    Louisa Gonzalez Somermeyer et al. Heterogeneity of the gfp fitness landscape and data-driven protein design.eLife, 11, May 2022. ISSN 2050-084X. doi: 10.7554/elife.75842

  66. [66]

    Pokusaeva et al

    Victoria O. Pokusaeva et al. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape.PLOS Genetics, 15(4):e1008079, April 2019. ISSN 1553-7404. doi: 10.1371/journal.pgen.1008079

  67. [67]

    Faure, Júlia Domingo, Jörn M

    Andre J. Faure, Júlia Domingo, Jörn M. Schmiedel, Cristina Hidalgo-Carcedo, Guillaume Diss, and Ben Lehner. Mapping the energetic and allosteric landscapes of protein binding domains. Nature, 604(7904):175–183, April 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04586-4

  68. [68]

    Faure, and Ben Lehner

    Chenchun Weng, Andre J. Faure, and Ben Lehner. The energetic and allosteric landscape for kras inhibition. December 2022. doi: 10.1101/2022.12.06.519122

  69. [69]

    Suiter et al

    Chase C. Suiter et al. Massively parallel variant characterization identifies nudt15 alleles associated with thiopurine toxicity.Proceedings of the National Academy of Sciences, 117(10): 5394–5401, February 2020. ISSN 1091-6490. doi: 10.1073/pnas.1915680117

  70. [70]

    Protein design using structure-based residue preferences, November 2022

    David Ding et al. Protein design using structure-based residue preferences, November 2022

  71. [71]

    Weinstein, Niall M

    Kotaro Tsuboyama, Justas Dauparas, Jonathan Chen, Elodie Laine, Yasser Mohseni Behbahani, Jonathan J. Weinstein, Niall M. Mangan, Sergey Ovchinnikov, and Gabriel J. Rocklin. Mega- scale experimental analysis of protein folding stability in biology and design.Nature, 620 (7973):434–444, July 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06328-6. 14 Broader...