pith. sign in

arxiv: 1906.08230 · v1 · pith:RMW3QW7Nnew · submitted 2019-06-19 · 💻 cs.LG · q-bio.BM· stat.ML

Evaluating Protein Transfer Learning with TAPE

Pith reviewed 2026-05-25 20:12 UTC · model grok-4.3

classification 💻 cs.LG q-bio.BMstat.ML
keywords protein embeddingssemi-supervised learningtransfer learningbenchmarkself-supervised pretrainingprotein biologyrepresentation learning
0
0 comments X

The pith

Self-supervised pretraining more than doubles performance on protein tasks but still lags non-neural methods in several cases

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TAPE, a benchmark of five semi-supervised tasks spanning protein biology domains, each with curated train-validation-test splits designed to measure generalization relevant to real scenarios. It evaluates multiple representation learning approaches and shows that self-supervised pretraining improves nearly every model on every task, sometimes more than doubling results. Even after this gain, the learned features remain behind those from leading non-neural techniques on multiple tasks. The work therefore positions TAPE as a tool to steer machine learning effort toward biologically meaningful protein problems while exposing a clear performance gap that new architectures could address.

Core claim

TAPE supplies five biologically relevant semi-supervised tasks with fixed splits; benchmarking across models demonstrates that self-supervised pretraining helps almost all of them on all tasks, more than doubling performance in some cases, yet the resulting features still trail state-of-the-art non-neural techniques on several tasks.

What carries the argument

TAPE benchmark: five semi-supervised protein tasks with curated splits that enforce biologically relevant generalization

If this is right

  • Self-supervised pretraining will raise performance on the TAPE protein tasks for nearly all models.
  • The persistent gap versus non-neural methods indicates a need for new architectures that better extract signal from biological sequences.
  • Standardized TAPE splits will let future work measure whether innovations actually improve real-world protein generalization.
  • Releasing all data and code will allow direct replication and extension of the reported comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining self-supervised embeddings with non-neural features could close part of the observed gap without new model design.
  • The same pretraining-plus-benchmark pattern could be applied to other sequence domains such as RNA or small molecules.
  • If the gap persists across future architectures, it may point to fundamental limits in what purely sequence-based models can capture from protein data.

Load-bearing premise

The curated training, validation, and test splits ensure that each task tests biologically relevant generalization that transfers to real-life scenarios.

What would settle it

A self-supervised model that matches or exceeds the best non-neural baseline on every TAPE task, or a controlled experiment showing pretraining yields no improvement on the same splits, would falsify the reported gap and benefit claims.

Figures

Figures reproduced from arXiv: 1906.08230 by John Canny, Neil Thomas, Nicholas Bhattacharya, Pieter Abbeel, Roshan Rao, Xi Chen, Yan Duan, Yun S. Song.

Figure 1
Figure 1. Figure 1: Structure and Annotation Tasks on protein KgdM Porin (pdbid: 4FQE). (a) Viewing this [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Protein Engineering Tasks. In both tasks, a parent protein [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of training, test, and pretrained Transformer predictions on the dark and bright [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Predicted contacts for chain 1A of a Bacterioferritin comigratory protein (pdbid: 3GKN). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Tasks Assessing Protein Embeddings (TAPE) benchmark consisting of five semi-supervised learning tasks across protein biology domains. It curates task-specific train/validation/test splits to promote biologically relevant generalization, benchmarks a range of self-supervised pretraining approaches and canonical sequence models, and reports that pretraining improves performance on nearly all tasks (more than doubling it in some cases) yet still trails state-of-the-art non-neural feature extraction methods on several tasks.

Significance. If the central empirical patterns hold under properly controlled generalization, the work supplies a much-needed standardized benchmark and reproducible codebase (all data and code released at https://github.com/songlab-cal/tape) that can focus community effort on scientifically grounded protein modeling problems and quantify the remaining gap between learned embeddings and established non-neural techniques.

major comments (1)
  1. [Abstract / task curation] Abstract (task curation paragraph) and the corresponding methods section on split construction: the statement that splits were curated “to ensure that each task tests biologically relevant generalization” is not accompanied by quantitative details on the homology-reduction procedure (sequence-identity threshold, structural clustering method, or evolutionary distance cutoff). Because every reported performance number and the comparison to non-neural baselines rests on the assumption of genuine out-of-distribution generalization, the absence of these metrics leaves open the possibility of residual homology leakage that would invalidate the headline claims.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from an explicit forward reference to the exact section or table that lists the five tasks and their data sources.
  2. [Figures] Figure captions should state the number of independent runs or seeds used to generate error bars so that readers can assess statistical reliability of the reported improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on the TAPE benchmark. We agree that greater transparency on the quantitative aspects of split construction is important for validating the out-of-distribution generalization claims. We will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Abstract / task curation] Abstract (task curation paragraph) and the corresponding methods section on split construction: the statement that splits were curated “to ensure that each task tests biologically relevant generalization” is not accompanied by quantitative details on the homology-reduction procedure (sequence-identity threshold, structural clustering method, or evolutionary distance cutoff). Because every reported performance number and the comparison to non-neural baselines rests on the assumption of genuine out-of-distribution generalization, the absence of these metrics leaves open the possibility of residual homology leakage that would invalidate the headline claims.

    Authors: We appreciate the referee drawing attention to this. The original manuscript describes the split construction for each task in the Methods section (e.g., remote homology uses a 40% sequence-identity threshold with structural clustering via SCOPe; fluorescence and stability use evolutionary distance cutoffs derived from phylogenetic trees; secondary structure and contact prediction use standard train/test splits with additional homology filtering). However, we acknowledge that these details were not presented with sufficient quantitative precision or uniformity. In the revised version we will add a dedicated subsection (or expanded table) that explicitly lists, for every task: (i) the sequence-identity threshold applied, (ii) the clustering algorithm or database used, (iii) any evolutionary-distance or structural criteria, and (iv) the resulting train/validation/test sizes after filtering. This will make the homology-reduction procedure fully reproducible and directly address the concern about possible leakage. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements

full rationale

The paper is an empirical benchmark study that reports performance numbers obtained by training and evaluating models on externally defined tasks and splits. No result is obtained by fitting a parameter to a subset of the data and then relabeling that fit as a prediction, nor does any central claim reduce to a self-citation chain or to a quantity defined in terms of itself. The curation of splits is presented as a methodological decision whose validity is left to external scrutiny; it is not derived from any equation or prior result internal to the paper. All comparisons to non-neural baselines are likewise direct measurements against independent reference methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the chosen tasks and splits are faithful proxies for real biological generalization; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The curated training, validation, and test splits test biologically relevant generalization that transfers to real-life scenarios.
    Stated explicitly in the abstract as the design goal of the benchmark splits.

pith-pipeline@v0.9.0 · 5786 in / 1235 out tokens · 37470 ms · 2026-05-25T20:12:01.496435+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

    cs.LG 2026-05 unverdicted novelty 7.0

    Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

  2. ProteinJEPA: Latent prediction complements protein language models

    cs.LG 2026-05 unverdicted novelty 7.0

    Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 2 Pith papers · 3 internal anchors

  1. [1]

    UniProt: a worldwide hub of protein knowledge

    The UniProt Consortium. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research, 47(D1):D506–D515, 2018

  2. [2]

    The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain

    C B Anfinsen, E Haber, M Sela, F H White, and Jr. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proceedings of the National Academy of Sciences of the United States of America, 47(9):1309–14, 1961

  3. [3]

    Protein Structure Relationships Revealed by Mutational Analysis

    C Yanofsky, V Horn, and D Thorpe. Protein Structure Relationships Revealed by Mutational Analysis. Science (New York, N.Y.), 146(3651):1593–4, 1964

  4. [4]

    Coordinated amino acid changes in homologous protein families

    D Altschuh, T Vernet, P Berti, D Moras, and K Nagai. Coordinated amino acid changes in homologous protein families. Protein engineering, 2(3):193–9, 1988

  5. [5]

    Deep Contextualized Word Representations

    Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Lo...

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 2018

  7. [7]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1:8, 2019

  8. [8]

    Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing

    Timothy A Whitehead, Aaron Chevalier, Yifan Song, Cyrille Dreyfus, Sarel J Fleishman, Cecilia De Mattos, Chris A Myers, Hetunandan Kamisetty, Patrick Blair, Ian A Wilson, et al. Optimization of affinity, specificity and function of designed influenza inhibitors using deep sequencing. Nature biotechnology, 30(6):543, 2012

  9. [9]

    Modular protein engineering in emerging cancer therapies

    Esther Vazquez, Neus Ferrer-Miralles, Ramon Mangues, Jose L Corchero, Jr Schwartz, An- tonio Villaverde, et al. Modular protein engineering in emerging cancer therapies. Current pharmaceutical design, 15(8):893–916, 2009

  10. [10]

    Sabir, Michael J

    Nelson Perdigão, Julian Heinrich, Christian Stolte, Kenneth S. Sabir, Michael J. Buckley, Bruce Tabor, Beth Signal, Brian S. Gloss, Christopher J. Hammang, Burkhard Rost, Andrea Schafferhans, and Seán I. O’Donoghue. Unexpected features of the dark proteome. Proceedings of the National Academy of Sciences, 112(52):15898–15903, 2015

  11. [11]

    Learning protein sequence embeddings using information from structure

    Tristan Bepler and Bonnie Berger. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations, 2019

  12. [12]

    Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M

    Ethan C. Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M. Church. Unified rational protein engineering with sequence-only deep representation learning. bioRxiv, page 589333, 2019

  13. [13]

    Gapped blast and psi-blast: a new generation of protein database search programs

    Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997

  14. [14]

    HHblits: lightning- fast iterative protein sequence searching by HMM-HMM alignment

    Michael Remmert, Andreas Biegert, Andreas Hauser, and Johannes Söding. HHblits: lightning- fast iterative protein sequence searching by HMM-HMM alignment. Nature Methods, 9(2):173– 175, 2012

  15. [15]

    Soding, A

    J. Soding, A. Biegert, and A. N. Lupas. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Research, 33(Web Server):W244–W248, 2005

  16. [16]

    Sean R. Eddy. Profile hidden markov models.Bioinformatics (Oxford, England), 14(9):755–763, 1998

  17. [17]

    Recommendations on nomenclature and symbolism for amino acids and peptides

    IUPAC-IUB. Recommendations on nomenclature and symbolism for amino acids and peptides. Pure Appl. Chem, 56:595–623, 1984. 10

  18. [18]

    Proteins: structures and molecular properties

    Thomas E Creighton. Proteins: structures and molecular properties. Macmillan, 1993

  19. [19]

    Twilight zone of protein sequence alignments

    Burkhard Rost. Twilight zone of protein sequence alignments. Protein Engineering, Design and Selection, 12(2):85–94, 1999

  20. [20]

    Steven E Brenner, Cyrus Chothia, and Tim JP Hubbard. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.Proceedings of the National Academy of Sciences, 95(11):6073–6078, 1998

  21. [21]

    Altschul, Warren Gish, Webb Miller, Eugene W

    Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990

  22. [22]

    Biological sequence analysis: probabilistic models of proteins and nucleic acids

    Richard Durbin, Sean R Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998

  23. [23]

    Identification of common molecular subsequences

    Temple F Smith, Michael S Waterman, et al. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195–197, 1981

  24. [24]

    The swiss-model workspace: a web-based environment for protein structure homology modelling

    Konstantin Arnold, Lorenza Bordoli, Jürgen Kopp, and Torsten Schwede. The swiss-model workspace: a web-based environment for protein structure homology modelling. Bioinformatics, 22(2):195–201, 2006

  25. [25]

    Semi-supervised learning

    Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning. IEEE Transactions on Neural Networks, 20(3):542–542, 2009

  26. [26]

    Critical assessment of methods of protein structure prediction (CASP)-Round XII

    John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tramontano. Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins: Structure, Function, and Bioinformatics, 86:7–15, 2018

  27. [27]

    ProteinNet: a standardized data set for machine learning of protein structure

    Mohammed AlQuraishi. ProteinNet: a standardized data set for machine learning of protein structure. bioRxiv, 2019

  28. [28]

    Semi-supervised protein classification using cluster kernels

    Jason Weston, Dengyong Zhou, André Elisseeff, William S Noble, and Christina S Leslie. Semi-supervised protein classification using cluster kernels. In Advances in neural information processing systems, pages 595–602, 2004

  29. [29]

    Prediction of protein function from networks

    Hyunjung Shin, Koji Tsuda, B Schölkopf, A Zien, et al. Prediction of protein function from networks. In Semi-supervised learning, pages 361–376. MIT press, 2006

  30. [30]

    Modeling the Language of Life - Deep Learning Protein Sequences

    Michael Heinzinger, Ahmed Elnaggar, Yu Wang, Christian 4 Dallago, Dmitrii Nachaev, Florian Matthes, and & Burkhard Rost. Modeling the Language of Life - Deep Learning Protein Sequences. bioRxiv, 2019

  31. [31]

    Lawrence Zitnick, Jerry Ma, and Rob Fergus

    Alexander Rives, Siddharth Goyal, Joshua Meier, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. bioRxiv, 2019

  32. [32]

    Riesselman, John B

    Adam J. Riesselman, John B. Ingraham, and Debora S. Marks. Deep generative models of genetic variation capture the effects of mutations. Nature Methods, 15(10):816–822, 2018

  33. [33]

    Learned protein embeddings for machine learning

    Kevin K Yang, Zachary Wu, Claire N Bedbrook, and Frances H Arnold. Learned protein embeddings for machine learning. Bioinformatics, 34(15):2642–2648, 2018

  34. [34]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019

  35. [35]

    The Pfam protein families database in 2019

    Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, Erik L L Sonnhammer, Layla Hirsh, Lisanna Paladin, Damiano Piovesan, Silvio C E Tosatto, and Robert D Finn. The Pfam protein families database in 2019. Nucleic Acids Research, 47(D1):D427–D432, 2019. 11

  36. [36]

    Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning

    Michael Schantz Klausen, Martin Closter Jespersen, Henrik Nielsen, Kamilla Kjaergaard Jensen, Vanessa Isabell Jurtz, Casper Kaae Soenderby, Morten Otto Alexander Sommer, Ole Winther, Morten Nielsen, Bent Petersen, et al. Netsurfp-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins: Structure, Function, and Bioinfo...

  37. [37]

    Alexey Drozdetskiy, Christian Cole, James Procter, and Geoffrey J. Barton. JPred4: a protein secondary structure prediction server. Nucleic Acids Research, 43(W1):W389–W394, 2015

  38. [38]

    Evaluation and improvement of multiple sequence methods for protein secondary structure prediction

    James A Cuff and Geoffrey J Barton. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins: Structure, Function, and Bioinformatics, 34(4):508–519, 1999

  39. [39]

    Kim, Frank DiMaio, Ray Yu-Ruei Wang, Yifan Song, and David Baker

    David E. Kim, Frank DiMaio, Ray Yu-Ruei Wang, Yifan Song, and David Baker. One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins: Structure, Function, and Bioinformatics, 82:208–218, 2014

  40. [40]

    Deepsf: deep convolutional neural network for mapping protein sequences to folds

    Jie Hou, Badri Adhikari, and Jianlin Cheng. Deepsf: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, 2017

  41. [41]

    Strategies and molecular tools to fight antimicrobial resistance: resistome, transcriptome, and antimicrobial peptides

    Leticia Stephan Tavares, Carolina dos Santos Fernandes da Silva, Vinicius Carius Souza, Vânia Lúcia da Silva, Cláudio Galuppo Diniz, and Marcelo De Oliveira Santos. Strategies and molecular tools to fight antimicrobial resistance: resistome, transcriptome, and antimicrobial peptides. Frontiers in microbiology, 4:412, 2013

  42. [42]

    Casx enzymes comprise a distinct family of rna-guided genome editors

    Jun-Jie Liu, Natalia Orlova, Benjamin L Oakes, Enbo Ma, Hannah B Spinner, Katherine LM Baney, Jonathan Chuck, Dan Tan, Gavin J Knott, Lucas B Harrington, et al. Casx enzymes comprise a distinct family of rna-guided genome editors. Nature, 566(7743):218, 2019

  43. [43]

    Local fitness landscape of the green fluorescent protein

    Karen S Sarkisyan, Dmitry A Bolotin, Margarita V Meer, Dinara R Usmanova, Alexander S Mishin, George V Sharonov, Dmitry N Ivankov, Nina G Bozhanova, Mikhail S Baranov, Onuralp Soylemez, et al. Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397, 2016

  44. [44]

    Machine learning-guided directed evolution for protein engineering

    Kevin K Yang, Zachary Wu, and Frances H Arnold. Machine learning in protein engineering. arXiv preprint arXiv:1811.10775, 2018

  45. [45]

    Global analysis of protein folding using massively parallel design, synthesis, and testing

    Gabriel J Rocklin, Tamuka M Chidyausiku, Inna Goreshnik, Alex Ford, Scott Houliston, Alexander Lemak, Lauren Carter, Rashmi Ravichandran, Vikram K Mulligan, Aaron Chevalier, et al. Global analysis of protein folding using massively parallel design, synthesis, and testing. Science, 357(6347):168–175, 2017

  46. [46]

    Recur- rent neural network based language model

    Tomáš Mikolov, Martin Karafiát, Lukáš Burget, JanˇCernock`y, and Sanjeev Khudanpur. Recur- rent neural network based language model. In Eleventh annual conference of the international speech communication association, 2010

  47. [47]

    DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

    Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. Journal of Machine Learning Research, 2013

  48. [48]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

  49. [49]

    Attention is All you Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In I Guyon, U V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, and R Garnett, editors,Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017

  50. [50]

    Dilated residual networks

    Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  51. [51]

    Multiplicative LSTM for sequence modelling

    Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959, 2016. 12

  52. [52]

    Assessment of contact predictions in casp12: Co-evolution and deep learning coming of age

    Joerg Schaarschmidt, Bohdan Monastyrskyy, Andriy Kryshtafovych, and Alexandre MJJ Bonvin. Assessment of contact predictions in casp12: Co-evolution and deep learning coming of age. Proteins: Structure, Function, and Bioinformatics, 86:51–66, 2018

  53. [53]

    Protein contact prediction by integrat- ing joint evolutionary coupling analysis and supervised learning

    Jianzhu Ma, Sheng Wang, Zhiyong Wang, and Jinbo Xu. Protein contact prediction by integrat- ing joint evolutionary coupling analysis and supervised learning. Bioinformatics, 31(21):3506– 3513, 2015

  54. [54]

    Critical assessment of methods of protein structure prediction (casp)—round xii

    John Moult, Krzysztof Fidelis, Andriy Kryshtafovych, Torsten Schwede, and Anna Tramontano. Critical assessment of methods of protein structure prediction (casp)—round xii. Proteins: Structure, Function, and Bioinformatics, 86:7–15, 2018

  55. [55]

    Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in bioinformatics, 19(3):482–494, 2016

    Yuedong Yang, Jianzhao Gao, Jihua Wang, Rhys Heffernan, Jack Hanson, Kuldip Paliwal, and Yaoqi Zhou. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Briefings in bioinformatics, 19(3):482–494, 2016

  56. [56]

    The cost and value of three-dimensional protein structure

    Raymond C Stevens. The cost and value of three-dimensional protein structure. Drug Discovery World, 4(3):35–48, 2003

  57. [57]

    Scope: Structural classification of proteins—extended, integrating scop and astral data and classification of new structures

    Naomi K Fox, Steven E Brenner, and John-Marc Chandonia. Scope: Structural classification of proteins—extended, integrating scop and astral data and classification of new structures. Nucleic acids research, 42(D1):D304–D309, 2013

  58. [58]

    The protein data bank.Nucleic acids research, 28(1):235–242, 2000

    Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank.Nucleic acids research, 28(1):235–242, 2000

  59. [59]

    Major new microbial groups expand diversity and alter our understanding of the tree of life

    Cindy J Castelle and Jillian F Banfield. Major new microbial groups expand diversity and alter our understanding of the tree of life. Cell, 172(6):1181–1197, 2018

  60. [60]

    CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations

    Stefan Seemayer, Markus Gruber, and Johannes Söding. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics, 30(21):3128– 3130, 2014

  61. [61]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear Memory Cost. arXiv, 2016

  62. [62]

    A series of pdb related databases for everyday needs

    Robbie P Joosten, Tim AH Te Beek, Elmar Krieger, Maarten L Hekkelman, Rob WW Hooft, Reinhard Schneider, Chris Sander, and Gert Vriend. A series of pdb related databases for everyday needs. Nucleic acids research, 39(suppl_1):D411–D419, 2010. 13 Table S1: Dataset sizes Task Train Valid Test Language Modeling 32,207,059 N/A 2,147,130 (Random-split) / 44,314...

  63. [63]

    All proteins of a given fold are further categorized into related superfamilies

    of hierarchically classified protein domains. All proteins of a given fold are further categorized into related superfamilies. Entire superfamilies are held out from the training set, allowing us to evaluate how the model generalizes across evolutionary distance when structure is preserved. (Labeling) Each fold is annotated from the structure of the sequen...

  64. [64]

    P@K” are precision for the top K contacts, where all contacts are sorted from highest confidence to lowest confidence. Note thatL is the length of the protein, so “P@L/2

    for details). We chose holdout clans and families in pairs, where a clan of novel function is held out together with a family that is similar in sequence but different evolutionarily or functionally. This 16 Table S2: Detailed secondary structure results Three-Way Accuracy (Q3) Eight-Way Accuracy (Q8) CB513 CASP12 TS115 CB513 CASP12 TS115 No Pretrain Tran...