Structure-Aware Masking for Protein Representation Learning
Pith reviewed 2026-05-20 19:22 UTC · model grok-4.3
The pith
Protein language models learn better when masking targets residues that are close together in 3D structure instead of choosing them at random.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bucket Masking is a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher-order mutational interactions. Through controlled ablations, these improvements arise from mask placement rather than span size, establishing
What carries the argument
Bucket Masking: a pretraining procedure that partitions sequence positions into buckets according to 3D residue contacts and then masks entire buckets together.
If this is right
- Models trained this way become more accurate at forecasting the combined effects of several mutations at once.
- The placement of masks, not merely their total count or length, supplies a useful positional bias for learning nonlocal dependencies.
- Downstream fitness predictors improve most on tasks that involve higher-order mutational interactions.
- The same masking change can be applied to any sequence model that is later evaluated on structure-dependent protein properties.
Where Pith is reading between the lines
- If contact maps from predicted structures work nearly as well as experimental ones, the method could be used at scale without needing new laboratory data for every protein.
- The same idea of grouping positions by geometry might transfer to other sequence domains where spatial or functional proximity matters, such as RNA or small-molecule binding sites.
- Future pretraining could combine this masking bias with explicit geometric losses to further strengthen the link between sequence representations and 3D structure.
Load-bearing premise
The 3D structural contacts used to form the buckets are both available and capture the functional couplings that determine fitness.
What would settle it
Retraining the same model with buckets formed by randomly grouping residues of the same sizes instead of using actual 3D contacts, then checking whether the 14 percent gain on the fitness tasks disappears.
Figures
read the original abstract
Masked language modeling (MLM) is the standard objective for training protein language models, typically implemented by randomly masking individual residues at a fixed rate (e.g., 15%). This practice implicitly assumes that all sequence positions contribute equally to representation learning. In downstream fitness prediction tasks, however, protein sequences are governed by three-dimensional structural dependencies and long-range residue contacts that induce strong nonlocal couplings between residues. We introduce Bucket Masking, a structure-aware masking strategy that selects groups of residues based on their proximity in three-dimensional space, preferentially masking structurally coupled regions during training. By conditioning the masking distribution on residue contacts, Bucket Masking shifts the learning objective toward modeling long-range interactions that are critical for protein function. Across four downstream protein fitness prediction tasks, Bucket Masking enables up to a 14% improvement over standard random masking, excelling at predicting higher-order mutational interactions. Through controlled ablations, we show that these improvements arise from mask placement rather than span size, establishing masking as a positional inductive bias.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Bucket Masking, a structure-aware variant of masked language modeling for protein sequences. Rather than masking individual residues uniformly at random, residues are grouped into buckets according to 3D spatial proximity (via contact maps) and entire buckets are masked together. The authors claim this induces a useful inductive bias for long-range interactions, yielding up to 14% gains over random masking on four downstream protein fitness prediction tasks and particularly improving prediction of higher-order mutational effects. Controlled ablations are presented to attribute the gains to mask placement rather than span length.
Significance. If the central results hold after clarification of contact-map provenance and statistical reporting, the work would demonstrate that a modest change to the pretraining masking distribution can measurably improve modeling of epistatic couplings without requiring structural inputs at inference. The explicit separation of placement from span size in the ablations is a methodological strength that helps isolate the claimed positional bias.
major comments (2)
- [Abstract and §3] Abstract and §3 (Bucket Masking definition): the claim that masking is conditioned on 'residue contacts' that capture 'functional couplings' is load-bearing for the 14% improvement and the higher-order mutation advantage, yet the manuscript provides no description of contact-map source (experimental PDB entries, AlphaFold predictions, or otherwise), distance threshold, or bucket-construction procedure. Without this, it is impossible to determine whether the reported gains reflect a general property of structure-aware masking or an artifact of the particular contact data used for the pretraining corpus and the four downstream tasks.
- [§4] §4 (downstream evaluation and ablations): the quantitative results (up to 14% improvement, advantage on higher-order mutations) are presented without error bars, number of random seeds, or statistical significance tests. Because the central claim rests on these gains being robust and attributable to mask placement, the absence of variance estimates leaves open the possibility that the differences are within run-to-run variability of the fitness predictors.
minor comments (2)
- [Methods] The abstract states that improvements 'arise from mask placement rather than span size,' but the precise definition of 'span size' (contiguous residues vs. bucket diameter) and how it is controlled in the ablation should be stated explicitly in the methods for reproducibility.
- [Results] Table or figure captions for the four fitness tasks should list the exact datasets, number of variants, and whether contacts were available at test time.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive overall assessment of the work. We have revised the manuscript to address the two major comments by adding the requested details on contact-map construction and by including statistical reporting for the experimental results. Point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Bucket Masking definition): the claim that masking is conditioned on 'residue contacts' that capture 'functional couplings' is load-bearing for the 14% improvement and the higher-order mutation advantage, yet the manuscript provides no description of contact-map source (experimental PDB entries, AlphaFold predictions, or otherwise), distance threshold, or bucket-construction procedure. Without this, it is impossible to determine whether the reported gains reflect a general property of structure-aware masking or an artifact of the particular contact data used for the pretraining corpus and the four downstream tasks.
Authors: We agree that the provenance and construction details are necessary for reproducibility and to support the central claims. The revised Section 3 now specifies that contact maps for the pretraining corpus are derived from experimental PDB entries (with AlphaFold models used only for sequences lacking PDB structures), using an 8 Å Cα distance threshold to define contacts. Buckets are formed by computing the connected components of the contact graph and discarding components smaller than a minimum size threshold; pseudocode for this procedure has been added to the appendix. These details establish that the masking strategy relies on standard structural biology definitions of spatial proximity, which prior literature has linked to functional couplings, rather than an idiosyncratic choice of data. revision: yes
-
Referee: [§4] §4 (downstream evaluation and ablations): the quantitative results (up to 14% improvement, advantage on higher-order mutations) are presented without error bars, number of random seeds, or statistical significance tests. Because the central claim rests on these gains being robust and attributable to mask placement, the absence of variance estimates leaves open the possibility that the differences are within run-to-run variability of the fitness predictors.
Authors: We concur that variance estimates and significance testing are required to substantiate the reported improvements. The revised Section 4 now reports all metrics as means over five independent random seeds with standard-deviation error bars. We have also added paired t-test p-values comparing Bucket Masking against the random-masking baseline; the improvements remain statistically significant (p < 0.05) on every task. The controlled ablations isolating mask placement from span length are likewise reported with these statistics, reinforcing that the gains arise from the positional bias rather than run-to-run variability. revision: yes
Circularity Check
No circularity detected; empirical improvements measured on independent downstream tasks
full rationale
The paper defines Bucket Masking via 3D structural proximity and reports performance gains on four held-out downstream fitness prediction tasks. These gains are evaluated externally rather than being fitted to or defined by the masking distribution itself. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description. The central claim rests on controlled ablations showing gains from mask placement, which are falsifiable against standard random masking baselines. This constitutes a self-contained empirical result with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Masked language modeling remains a suitable pretraining objective when the masking distribution is altered to reflect structural proximity.
invented entities (1)
-
Bucket Masking
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we construct a residue contact graph from the wild-type (WT) protein structure and partition contacts into distance-based “buckets” according to spatial proximity... τ=7, following empirical evaluations for fold discrimination
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A., Mathur, S., Salabert, D., Ballot, J., R´egulo, C., Metcalfe, T
Zeming Lin et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, March 2023. ISSN 1095-9203. doi: 10.1126/science. ade2574
-
[2]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapoli...
-
[3]
ERNIE: Enhanced language representation with informative entities
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. ERNIE: Enhanced language representation with informative entities. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1441–1451, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1139
-
[4]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu et al. Roberta: A robustly optimized bert pretraining approach, 2019. URL https://arxiv.org/abs/1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[5]
Weld, Luke Zettlemoyer, and Omer Levy
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.Transactions of the Association for Computational Linguistics, 8:64–77, December 2020. ISSN 2307-387X. doi: 10.1162/tacl_a_00300
-
[6]
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. LUKE: Deep contextualized entity representations with entity-aware self-attention. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6442–6454, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/...
-
[7]
Ahmed Elnaggar et al. Prottrans: Toward understanding the language of life through self- supervised learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10): 7112–7127, October 2022. ISSN 1939-3539. doi: 10.1109/tpami.2021.3095381
-
[8]
Alexandra Shulman-Peleg, Ruth Nussinov, and Haim J. Wolfson. Recognition of functional sites in protein structures.Journal of Molecular Biology, 339(3):607–633, June 2004. ISSN 0022-2836. doi: 10.1016/j.jmb.2004.04.012
-
[9]
Sequence co-evolution gives 3D contacts and structures of protein complexes.eLife, 3, September 2014
Thomas A Hopf et al. Sequence co-evolution gives 3D contacts and structures of protein complexes.eLife, 3, September 2014. ISSN 2050-084X. doi: 10.7554/elife.03430
-
[10]
McDonald, Craig Gambogi, Andrew L
Jian Wang, Abha Jain, Leanna R. McDonald, Craig Gambogi, Andrew L. Lee, and Niko- lay V . Dokholyan. Mapping allosteric communications within individual proteins.Nature Communications, 11(1), July 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-17618-2
-
[11]
Nan Wu, Léonie Strömich, and Sophia N. Yaliraki. Prediction of allosteric sites and signaling: Insights from benchmarking datasets.Patterns, 3(1):100408, January 2022. ISSN 2666-3899. doi: 10.1016/j.patter.2021.100408
-
[12]
EntityBERT: Entity-centric masking strategy for model pretraining for the clinical domain
Chen Lin, Timothy Miller, Dmitriy Dligach, Steven Bethard, and Guergana Savova. EntityBERT: Entity-centric masking strategy for model pretraining for the clinical domain. InProceedings of the 20th Workshop on Biomedical Language Processing, pages 191–201, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.bionlp-1.21
-
[13]
Sundaram, Wolfgang Nejdl, and Niloy Ganguly
Soumyadeep Roy, Jonas Wallat, Sowmya S. Sundaram, Wolfgang Nejdl, and Niloy Ganguly. GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. IOS Press, September 2023. ISBN 9781643684376. doi: 10.3233/faia230492
-
[14]
Pre-training a BERT with curriculum learning by increasing block-size of input text
Koichi Nagatsuka, Clifford Broni-Bediako, and Masayasu Atsumi. Pre-training a BERT with curriculum learning by increasing block-size of input text. InProceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications, RANLP 2021, page 989–996. INCOMA Ltd. Shoumen, BULGARIA,...
-
[15]
Efficient pre- training of masked language model via concept-based curriculum masking
Mingyu Lee, Jun-Hyung Park, Junho Kim, Kang-Min Kim, and SangKeun Lee. Efficient pre- training of masked language model via concept-based curriculum masking. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7417–7427, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. d...
-
[16]
Learning better masking for better language model pre-training
Dongjie Yang, Zhuosheng Zhang, and Hai Zhao. Learning better masking for better language model pre-training. InProceedings of the 61st Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 7255–7267, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.400
-
[17]
Soumyadeep Roy, Shamik Sural, and Niloy Ganguly.Unlocking Efficiency: Adaptive Masking for Gene Transformer Models. IOS Press, October 2024. ISBN 9781643685489. doi: 10.3233/ faia240864
work page 2024
-
[18]
Sophia Vincoff, Shrey Goel, Kseniia Kholina, Rishab Pulugurta, Pranay Vure, and Pranam Chatterjee. Fuson-plm: a fusion oncoprotein-specific language model via adjusted rate mask- ing.Nature Communications, 16(1), February 2025. ISSN 2041-1723. doi: 10.1038/ s41467-025-56745-6
work page 2025
-
[19]
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2022. doi: 10.1109/CVPR52688.2022. 01553
-
[20]
Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for self-supervised visual pre-training. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14648–14658, 2022. doi: 10.1109/ CVPR52688.2022.01426
-
[21]
Training compute-optimal protein language models, 2024
Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, and Le Song. Training compute-optimal protein language models, 2024. URLhttps://arxiv.org/abs/2411.02142
-
[22]
Karenna Ng and Bryan Briney. Focused learning by antibody language models using preferential masking of non-templated regions.Patterns, 6(6):101239, June 2025. ISSN 2666-3899. doi: 10.1016/j.patter.2025.101239
-
[23]
Understanding and enhancing mask-based pretraining towards universal representations
Mingze Dong, Leda Wang, and Yuval Kluger. Understanding and enhancing mask-based pretraining towards universal representations. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
work page 2025
-
[24]
Roshan M Rao, Jason Liu, Robert Verkuil, Joshua Meier, John Canny, Pieter Abbeel, Tom Sercu, and Alexander Rives. MSA transformer. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8844–8856. PMLR, 18–24 Jul 2021
work page 2021
-
[25]
Saprot: Protein language modeling with structure-aware vocabulary
Jin Su, Chenchen Han, Yuyang Zhou, Junjie Shan, Xibin Zhou, and Fajie Yuan. Saprot: Protein language modeling with structure-aware vocabulary. InInternational Conference on Learning Representations, volume 2024, pages 6987–7009, 2024
work page 2024
-
[26]
R. Rao, N. Bhattacharya, N. Thomas, Y . Duan, X. Chen, J. Canny, P. Abbeel, and Y . S. Song. Evaluating protein transfer learning with tape. InAdvances in Neural Information Processing Systems, volume 32, pages 9689–9701, Dec 2019. PMID: 33390682; PMCID: PMC7774645
work page 2019
-
[27]
Sparse autoencoders for low- n protein function prediction and design, 2025
Darin Tsui, Kunal Talreja, and Amirali Aghazadeh. Sparse autoencoders for low- n protein function prediction and design, 2025. URLhttps://arxiv.org/abs/2508.18567
-
[28]
Golf: A generative ai framework for pathogenicity prediction of myocilin olf variants
Thomas Walton, Darin Tsui, Lauren Fogel, Dustin Huard, Rafael Chagas, Raquel Lieberman, and Amirali Aghazadeh. Golf: A generative ai framework for pathogenicity prediction of myocilin olf variants. InProceedings of the 20th Machine Learning in Computational Biology meeting, volume 311 ofProceedings of Machine Learning Research, pages 148–161. PMLR, 10–11 ...
work page 2025
-
[29]
B.J. Strait and T.G. Dewey. The shannon information entropy of protein sequences.Biophysical Journal, 71(1):148–155, July 1996. ISSN 0006-3495. doi: 10.1016/s0006-3495(96)79210-x
-
[30]
Dan Ofer, Nadav Brandes, and Michal Linial. The language of proteins: NLP, machine learning and protein sequences.Computational and Structural Biotechnology Journal, 19:1750–1758,
-
[31]
doi: 10.1016/j.csbj.2021.03.022
ISSN 2001-0370. doi: 10.1016/j.csbj.2021.03.022
-
[32]
Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics, 38(8): 2102–2110, February 2022. ISSN 1367-4811. doi: 10.1093/bioinformatics/btac020
-
[33]
C. Chothia and A.M. Lesk. The relation between the divergence of sequence and structure in proteins.The EMBO Journal, 5(4):823–826, April 1986. ISSN 0261-4189. doi: 10.1002/j. 1460-2075.1986.tb04288.x
work page doi:10.1002/j 1986
-
[34]
Kristoffer Illergård, David H. Ardell, and Arne Elofsson. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores.Proteins: Structure, Function, and Bioinformatics, 77(3):499–508, June 2009. ISSN 1097-0134. doi: 10.1002/prot.22458
-
[35]
Proteingym: Large-scale benchmarks for protein design and fitness prediction
Pascal Notin et al. Proteingym: Large-scale benchmarks for protein design and fitness prediction. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2023
work page 2023
-
[36]
H. M. Berman. The protein data bank.Nucleic Acids Research, 28(1):235–242, January 2000. ISSN 1362-4962. doi: 10.1093/nar/28.1.235
-
[37]
UniProt: the uni- versal protein knowledgebase in 2025
Alex Bateman et al. Uniprot: the universal protein knowledgebase in 2025.Nucleic Acids Research, 53(D1):D609–D617, November 2024. ISSN 1362-4962. doi: 10.1093/nar/gkae1010
-
[38]
Chao Yuan, Hao Chen, and Daisuke Kihara. Effective inter-residue contact definitions for accurate protein fold recognition.BMC Bioinformatics, 13(1), November 2012. ISSN 1471-
work page 2012
-
[39]
doi: 10.1186/1471-2105-13-292
-
[40]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
-
[41]
Junming Zhao, Chao Zhang, and Yunan Luo. Contrastive fitness learning: Reprogramming protein language models for low-n learning of protein fitness landscape. InInternational Conference on Research in Computational Molecular Biology, pages 470–474. Springer, 2024
work page 2024
-
[42]
Aleix Lafita, Ferran Gonzalez, Mahmoud Hossam, Paul Smyth, Jacob Deasy, Ari Allyn-Feuer, Daniel Seaton, and Stephen Young. Fine-tuning protein language models with deep mutational scanning improves variant effect prediction.arXiv preprint arXiv:2405.06729, 2024
-
[43]
Alex Hawkins-Hooker, Shikha Surana, Jack Simons, Jakub Kmec, Oliver Bent, and Paul Duckworth. Likelihood-based fine-tuning of protein language models for few-shot fitness prediction and design.bioRxiv, pages 2024–05, 2024
work page 2024
-
[44]
Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776–3786, 2025. ISSN 2998-
work page 2025
-
[45]
doi: 10.1109/taslpro.2025.3606231
-
[46]
SpecMER: Fast protein generation with k-mer guided speculative decoding
Thomas Walton, Darin Tsui, Aryan Musharaf, and Amirali Aghazadeh. SpecMER: Fast protein generation with k-mer guided speculative decoding. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id= 2sG4ebgqBd
work page 2026
-
[47]
Poelwijk, Michael Socolich, and Rama Ranganathan
Frank J. Poelwijk, Michael Socolich, and Rama Ranganathan. Learning the pattern of epistasis linking genotype and phenotype in a protein.Nature Communications, 10(1), September 2019. ISSN 2041-1723. doi: 10.1038/s41467-019-12130-8. 12
-
[48]
Genotype to phenotype mapping and the fitness landscape of the e
Jakub Otwinowski and Ilya Nemenman. Genotype to phenotype mapping and the fitness landscape of the e. coli lac promoter.PLoS ONE, 8(5):e61570, May 2013. ISSN 1932-6203. doi: 10.1371/journal.pone.0061570
-
[49]
Rhys M. Adams, Justin B. Kinney, Aleksandra M. Walczak, and Thierry Mora. Epistasis in a fitness landscape defined by antibody-antigen binding free energy.Cell Systems, 8(1):86–93.e3, January 2019. ISSN 2405-4712. doi: 10.1016/j.cels.2018.12.004
-
[50]
Faure, Aina Martí-Aranda, Cristina Hidalgo-Carcedo, Antoni Beltran, Jörn M
Andre J. Faure, Aina Martí-Aranda, Cristina Hidalgo-Carcedo, Antoni Beltran, Jörn M. Schmiedel, and Ben Lehner. The genetic architecture of protein stability.Nature, 634(8035): 995–1003, September 2024. ISSN 1476-4687. doi: 10.1038/s41586-024-07966-0
-
[51]
Anneliese J Morrison, Daria R Wonderlick, and Michael J Harms. Ensemble epistasis: ther- modynamic origins of nonadditivity between mutations.Genetics, 219(1), July 2021. ISSN 1943-2631. doi: 10.1093/genetics/iyab105
-
[52]
On recovering higher-order interactions from protein language models, 2024
Darin Tsui and Amirali Aghazadeh. On recovering higher-order interactions from protein language models, 2024
work page 2024
-
[53]
Richard Michael, Jacob Kæstel-Hansen, Peter Mørch Groth, Simon Bartels, Jesper Salomon, Pengfei Tian, Nikos S. Hatzakis, and Wouter Boomsma. Assessing the performance of protein regression models, June 2023
work page 2023
-
[54]
Lin Chen et al. Learning protein fitness landscapes with deep mutational scanning data from multiple sources.Cell Systems, 14(8):706–721.e5, August 2023. ISSN 2405-4712. doi: 10.1016/j.cels.2023.07.003
-
[55]
Fahlberg, Pete Heinzelman, Philip A
Sam Gelman, Sarah A. Fahlberg, Pete Heinzelman, Philip A. Romero, and Anthony Gitter. Neural networks to learn protein sequence–function relationships from deep mutational scanning data.Proceedings of the National Academy of Sciences, 118(48), November 2021. ISSN 1091-
work page 2021
-
[56]
doi: 10.1073/pnas.2104878118
-
[57]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020. doi: 10.1109/CVPR42600. 2020.00975
-
[58]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. InProceedings of the 37th International Conference on Machine Learning (ICML), pages 1597–1607, 2020
work page 2020
-
[59]
Generative AA V capsid diversifica- tion by latent interpolation, April 2021
Sam Sinai, Nina Jain, George M Church, and Eric D Kelsic. Generative AA V capsid diversifica- tion by latent interpolation, April 2021
work page 2021
-
[60]
Comprehensive fitness maps of hsp90 show widespread environmental dependence.eLife, 9, March 2020
Julia M Flynn, Ammeret Rossouw, Pamela Cote-Hammarlof, Inês Fragata, David Mavor, Carl Hollins, Claudia Bank, and Daniel NA Bolon. Comprehensive fitness maps of hsp90 show widespread environmental dependence.eLife, 9, March 2020. ISSN 2050-084X. doi: 10.7554/elife.53810
-
[61]
Carlos L. Araya, Douglas M. Fowler, Wentao Chen, Ike Muniez, Jeffery W. Kelly, and Stanley Fields. A fundamental protein property, thermodynamic stability, revealed solely from large- scale measurements of protein function.Proceedings of the National Academy of Sciences, 109 (42):16858–16863, October 2012. ISSN 1091-6490. doi: 10.1073/pnas.1209751109
-
[62]
C. Anders Olson, Nicholas C. Wu, and Ren Sun. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain.Current Biology, 24(22):2643–2651, November 2014. ISSN 0960-9822. doi: 10.1016/j.cub.2014.09.072
-
[63]
Max V . Staller, Alex S. Holehouse, Devjanee Swain-Lenz, Rahul K. Das, Rohit V . Pappu, and Barak A. Cohen. A high-throughput mutational scan of an intrinsically disordered acidic transcriptional activation domain.Cell Systems, 6(4):444–455.e6, April 2018. ISSN 2405-4712. doi: 10.1016/j.cels.2018.01.015. 13
-
[64]
Karen S. Sarkisyan et al. Local fitness landscape of the green fluorescent protein.Nature, 533 (7603):397–401, May 2016. ISSN 1476-4687. doi: 10.1038/nature17995
-
[65]
Heterogeneity of the gfp fitness landscape and data-driven protein design.eLife, 11, May 2022
Louisa Gonzalez Somermeyer et al. Heterogeneity of the gfp fitness landscape and data-driven protein design.eLife, 11, May 2022. ISSN 2050-084X. doi: 10.7554/elife.75842
-
[66]
Victoria O. Pokusaeva et al. An experimental assay of the interactions of amino acids from orthologous sequences shaping a complex fitness landscape.PLOS Genetics, 15(4):e1008079, April 2019. ISSN 1553-7404. doi: 10.1371/journal.pgen.1008079
-
[67]
Andre J. Faure, Júlia Domingo, Jörn M. Schmiedel, Cristina Hidalgo-Carcedo, Guillaume Diss, and Ben Lehner. Mapping the energetic and allosteric landscapes of protein binding domains. Nature, 604(7904):175–183, April 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04586-4
-
[68]
Chenchun Weng, Andre J. Faure, and Ben Lehner. The energetic and allosteric landscape for kras inhibition. December 2022. doi: 10.1101/2022.12.06.519122
-
[69]
Chase C. Suiter et al. Massively parallel variant characterization identifies nudt15 alleles associated with thiopurine toxicity.Proceedings of the National Academy of Sciences, 117(10): 5394–5401, February 2020. ISSN 1091-6490. doi: 10.1073/pnas.1915680117
-
[70]
Protein design using structure-based residue preferences, November 2022
David Ding et al. Protein design using structure-based residue preferences, November 2022
work page 2022
-
[71]
Kotaro Tsuboyama, Justas Dauparas, Jonathan Chen, Elodie Laine, Yasser Mohseni Behbahani, Jonathan J. Weinstein, Niall M. Mangan, Sergey Ovchinnikov, and Gabriel J. Rocklin. Mega- scale experimental analysis of protein folding stability in biology and design.Nature, 620 (7973):434–444, July 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06328-6. 14 Broader...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.