pith. sign in

arxiv: 2605.25242 · v1 · pith:SXR7SSYPnew · submitted 2026-05-24 · 🧬 q-bio.GN

C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation

Pith reviewed 2026-06-29 23:02 UTC · model grok-4.3

classification 🧬 q-bio.GN
keywords bacterial gene regulationcontrastive pretrainingpromoter representationsprotein language modelszero-shot retrievalgenome language modelsregulatory sequencesco-regulated genes
0
0 comments X

The pith

Contrastive alignment of bacterial promoters with their proteins yields representations that capture gene regulation better than genome language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that contrastive pretraining, by aligning promoter sequences to the embeddings of their corresponding proteins, produces more effective representations for bacterial regulatory sequences than standard genome language models trained on sequence reconstruction. This approach addresses the challenge of noisy and rapidly evolving regulatory DNA by using protein information as a supervisory signal. A sympathetic reader would care because it demonstrates improved performance in predicting regulatory annotations and enables zero-shot identification of co-regulated genes directly from genomes, potentially allowing gene regulation decoding for many bacteria without additional experiments. Scaling analysis indicates the method reaches strong results at lower computational cost than leading alternatives.

Core claim

By training on 88 million bacterial promoter-protein pairs to align promoters with protein embeddings from protein language models, C3P produces promoter representations that achieve multi-fold improvements in predicting curated regulatory annotations and enable significant zero-shot co-regulated gene retrieval, unlike gLMs.

What carries the argument

The contrastive loss that aligns promoter embeddings with corresponding protein embeddings, using protein language model representations as the supervisory signal.

If this is right

  • Multi-fold improvement over gLMs for inference of regulatory annotations
  • Significant zero-shot gains in co-regulated gene retrieval compared to random baseline and gLMs
  • Achieves strong performance at a fraction of the training cost of leading gLMs
  • Further improvement is possible with additional scaling of the training data

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may allow inference of gene regulation in uncultured bacteria solely from their genome sequences.
  • Protein embeddings could serve as a general supervisory signal for learning representations of other regulatory sequence types.
  • Zero-shot retrieval performance could be validated against experimentally confirmed co-regulation datasets in well-studied model bacteria.

Load-bearing premise

Protein embeddings encode information relevant to the regulatory function of the promoters that control their expression.

What would settle it

If C3P representations do not show improved performance over gLMs when tested on a large independent set of curated bacterial regulatory annotations from a different species.

Figures

Figures reproduced from arXiv: 2605.25242 by Alan M. Moses, Cameron Dufault, Scott Xu.

Figure 1
Figure 1. Figure 1: An overview of (a) C3P dataset creation and (b) architecture and training. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of C3P versus Evo2 and ESM2 per-class performance, as measured with [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of C3P versus Evo2 performance measured by top-1 full match accuracy and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: UMAP plots of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the performance of C3P and each gLM baseline against their random [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling of regulon classification performance (average over 102 regulons) and zero-shot [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the number of genes annotated for each of the 102 regulons and 6 sigmulons [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scaling of C3P and each gLM baseline. Average zero-shot co-regulated gene retrieval top-1 [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Spearman’s correlation of predictions from ridge regression models trained with embeddings [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Despite the increasing scale of genome language models (gLMs), their ability to decode the function of regulatory sequences remains unclear. gLM pretraining relies on sequence reconstruction, which may struggle due to the noisy, rapidly evolving nature of regulatory DNA. Self-supervised contrastive approaches provide a promising alternative. Inspired by language-image architectures like CLIP, we introduce contrastive promoter-protein pretraining (C3P). By learning to align promoters to their corresponding proteins, we leverage the rich representations of proteins learned by protein language models as supervisory signal for the learning of promoter representations. After training on 88 million bacterial promoter-protein pairs, we evaluate the predictive power of C3P-learned promoter representations for inference of curated regulatory annotations, finding multi-fold improvement over leading gLMs. We also introduce zero-shot co-regulated gene retrieval, the ability to find co-regulated genes in a genome using no experimental data. We find that compared to a randomly initialized baseline, C3P training consistently provides significant zero-shot performance gains, unlike gLMs. Scaling analysis reveals the potential for further improvement as well as the efficiency of C3P, which achieved strong performance at a fraction of the training cost of leading gLMs. In addition to demonstrating that C3P training is effective for learning representations of bacterial regulatory sequences, our strong zero-shot co-regulated gene retrieval performance suggests the possibility of decoding gene regulation for millions of bacteria from their genomes alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces C3P, a contrastive pretraining method inspired by CLIP that aligns bacterial promoters with their corresponding protein embeddings from protein language models. Trained on 88 million promoter-protein pairs, the resulting promoter representations are shown to outperform leading genome language models in inferring curated regulatory annotations and achieve significant gains in a novel zero-shot co-regulated gene retrieval task, while being more efficient.

Significance. If the empirical gains are robust, this work provides a scalable and data-efficient approach to learning regulatory sequence representations by leveraging protein supervision, potentially enabling better decoding of gene regulation across bacterial genomes without extensive experimental data. The introduction of the zero-shot retrieval task is a notable contribution.

major comments (2)
  1. [Abstract] Abstract: The claim of 'multi-fold improvement over leading gLMs' for inference of curated regulatory annotations lacks any specification of the exact gLM baselines, statistical tests, data splits, or ablation controls, preventing assessment of whether post-hoc choices or dataset biases affect the central claim.
  2. [Results] Results (zero-shot retrieval and scaling analysis): The reported gains are attributed to the contrastive alignment supplying regulatory signal from protein embeddings, but no direct controls (e.g., shuffled pairs or protein-embedding ablations) are described to rule out that gains arise instead from architecture scale, data volume, or objective differences; this is load-bearing for the claim that protein embeddings encode promoter-specific features.
minor comments (1)
  1. [Abstract] Abstract: The efficiency claim ('fraction of the training cost of leading gLMs') is stated without quantitative comparison of FLOPs, wall-clock time, or parameter counts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight opportunities to improve clarity and strengthen the attribution of results. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim of 'multi-fold improvement over leading gLMs' for inference of curated regulatory annotations lacks any specification of the exact gLM baselines, statistical tests, data splits, or ablation controls, preventing assessment of whether post-hoc choices or dataset biases affect the central claim.

    Authors: We agree that greater specificity in the abstract will aid assessment of the claims. The full manuscript already details the gLM baselines (e.g., specific models compared in Results), statistical tests (paired t-tests with reported p-values), chromosome-level data splits to prevent leakage, and ablation controls in the Supplementary Information. In revision we will update the abstract to explicitly name the baselines, reference the tests and splits, and point to the ablations, ensuring the central claim is presented with appropriate context. revision: yes

  2. Referee: [Results] Results (zero-shot retrieval and scaling analysis): The reported gains are attributed to the contrastive alignment supplying regulatory signal from protein embeddings, but no direct controls (e.g., shuffled pairs or protein-embedding ablations) are described to rule out that gains arise instead from architecture scale, data volume, or objective differences; this is load-bearing for the claim that protein embeddings encode promoter-specific features.

    Authors: We acknowledge the value of direct controls to isolate the role of contrastive alignment with protein embeddings. Existing comparisons to a randomly initialized baseline and to gLMs provide partial controls for architecture, data volume, and objective, and the zero-shot gains over random initialization support the contribution of C3P training. However, to more rigorously rule out alternative explanations, we will add the requested experiments during revision: training on shuffled promoter-protein pairs and ablations replacing protein embeddings with random or non-informative vectors. These will be reported in the zero-shot retrieval and scaling sections to directly test whether protein-specific features drive the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method uses external pairs and pLMs with independent downstream evaluations

full rationale

The paper defines C3P as standard contrastive alignment of promoters to protein embeddings from external pLMs on 88M pairs, then reports empirical gains on held-out regulatory annotation inference and zero-shot co-regulated gene retrieval. No equations reduce these gains to quantities defined by the training objective itself. No self-citation chains or fitted inputs are invoked as load-bearing for the central claims. The derivation remains self-contained against external benchmarks and does not match any enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the utility of protein LM embeddings as a supervisory signal for promoter function and on the quality of the 88M promoter-protein pairings; no free parameters or invented entities are named in the abstract.

free parameters (1)
  • contrastive temperature and model hyperparameters
    Standard ML training choices whose specific values are not reported in the abstract but affect the learned representations.
axioms (1)
  • domain assumption Protein language model representations encode information relevant to promoter regulatory function
    Invoked when using protein embeddings to supervise promoter learning.

pith-pipeline@v0.9.1-grok · 5794 in / 1113 out tokens · 28947 ms · 2026-06-29T23:02:19.807254+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Joint-Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self-Supervised Learning

    Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, and Randall Balestriero. Joint-Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self-Supervised Learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, October 2025

  2. [2]

    Ledsam, Agnieszka Grabska- Barwinska, Kyle R

    Žiga Avsec, Vikram Agarwal, Daniel Visentin, Joseph R. Ledsam, Agnieszka Grabska- Barwinska, Kyle R. Taylor, Yannis Assael, John Jumper, Pushmeet Kohli, and David R. Kelley. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, October 2021

  3. [3]

    Žiga Avsec, Natasha Latysheva, Jun Cheng, Guido Novati, Kyle R. Taylor, Tom Ward, Clare Bycroft, Lauren Nicolaisen, Eirini Arvaniti, Joshua Pan, Raina Thomas, Vincent Dutordoir, Matteo Perino, Soham De, Alexander Karollus, Adam Gayoso, Toby Sargeant, Anne Mottram, Lai Hong Wong, Pavol Drotár, Adam Kosiorek, Andrew Senior, Richard Tanburn, Taylor Applebaum...

  4. [4]

    Baumgart, Ji Eun Lee, Asaf Salamov, David J

    Leo A. Baumgart, Ji Eun Lee, Asaf Salamov, David J. Dilworth, Hyunsoo Na, Matthew Mingay, Matthew J. Blow, Yu Zhang, Yuko Yoshinaga, Chris G. Daum, and Ronan C. O’Malley. Persistence and plasticity in bacterial gene regulation.Nature Methods, 18(12):1499–1505, December 2021

  5. [5]

    Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwang- gyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A

    Garyk Brixi, Matthew G. Durrant, Jerome Ku, Mohsen Naghipourfar, Michael Poli, Gwang- gyu Sun, Greg Brockman, Daniel Chang, Alison Fanton, Gabriel A. Gonzalez, Samuel H. King, David B. Li, Aditi T. Merchant, Eric Nguyen, Chiara Ricci-Tam, David W. Romero, Jonathan C. Schmok, Ali Taghibakhshi, Anton V orontsov, Brandon Yang, Myra Deng, Liv Gorton, Nam Nguy...

  6. [6]

    Exploring Simple Siamese Representation Learning, November

    Xinlei Chen and Kaiming He. Exploring Simple Siamese Representation Learning, November

  7. [7]

    arXiv:2011.10566 [cs]

  8. [8]

    Consens, Cameron Dufault, Michael Wainberg, Duncan Forster, Mehran Karimzadeh, Hani Goodarzi, Fabian J

    Micaela E. Consens, Cameron Dufault, Michael Wainberg, Duncan Forster, Mehran Karimzadeh, Hani Goodarzi, Fabian J. Theis, Alan Moses, and Bo Wang. Transformers and genome language models.Nature Machine Intelligence, 7(3):346–362, March 2025

  9. [9]

    The OMG dataset: An open metagenomic corpus for mixed-modality genomic language modeling

    Andre Cornman, Jacob West-Roberts, Antonio Pedro Camargo, Simon Roux, Martin Bera- cochea, Milot Mirdita, Sergey Ovchinnikov, and Yunha Hwang. The OMG dataset: An open metagenomic corpus for mixed-modality genomic language modeling. InThe Thirteenth International Conference on Learning Representations, 2025

  10. [10]

    Eisen, Paul T

    Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns.Proceedings of the National Academy of Sciences, 95(25):14863–14868, December 1998

  11. [11]

    Learning Visual Representations via Language-Guided Sampling

    Mohamed El Banani, Karan Desai, and Justin Johnson. Learning Visual Representations via Language-Guided Sampling. InCVPR, 2023

  12. [12]

    Frey, Leo J

    Philip Fradkin, Ruian “Ian” Shi, Taykhoom Dalal, Keren Isaev, Brendan J. Frey, Leo J. Lee, Quaid Morris, and Bo Wang. Orthrus: toward evolutionary and functional RNA foundation models.Nature Methods, pages 1–11, April 2026. 11

  13. [13]

    NCBI RefSeq: reference sequence standards through 25 years of curation and annotation.Nucleic Acids Research, 53(D1):D243–D257, January 2025

    Tamara Goldfarb, Vamsi K Kodali, Shashikant Pujar, Vyacheslav Brover, Barbara Robbertse, Catherine M Farrell, Dong-Ha Oh, Alexander Astashyn, Olga Ermolaeva, Diana Haddad, Wratko Hlavina, Jinna Hoffman, John D Jackson, Vinita S Joardar, David Kristensen, Patrick Masterson, Kelly M McGarvey, Richard McVeigh, Eyal Mozes, Michael R Murphy, Susan S Schafer, A...

  14. [14]

    Halper, Daniel P

    Ayaan Hossain, Eriberto Lopez, Sean M. Halper, Daniel P. Cetnar, Alexander C. Reis, Devin Strickland, Eric Klavins, and Howard M. Salis. Automated design of thousands of nonrepeti- tive parts for engineering stable genetic systems.Nature Biotechnology, 38(12):1466–1475, December 2020

  15. [15]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 4904–4916. P...

  16. [16]

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reim...

  17. [17]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020. arXiv:2001.08361 [cs]

  18. [18]

    Flexible comparative genomics of prokaryotic transcriptional regulatory networks.BMC Genomics, 21(5):466, December 2020

    Sefa Kılıç, Miquel Sánchez-Osuna, Antonio Collado-Padilla, Jordi Barbé, and Ivan Erill. Flexible comparative genomics of prokaryotic transcriptional regulatory networks.BMC Genomics, 21(5):466, December 2020

  19. [19]

    LaFleur, Ayaan Hossain, and Howard M

    Travis L. LaFleur, Ayaan Hossain, and Howard M. Salis. Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria.Nature Communications, 13(1):5159, September 2022

  20. [20]

    Harnessing a unified multi-modal sequence modeling to unveil protein-dna interdependency.bioRxiv, pages 2025–02, 2025

    Mingchen Li, Yuchen Ren, Peng Ye, Jiabei Cheng, Xinzhu Ma, Yuchen Cai, Wanli Ouyang, Bozitao Zhong, Banghao Wu, Nanqing Dong, et al. Harnessing a unified multi-modal sequence modeling to unveil protein-dna interdependency.bioRxiv, pages 2025–02, 2025

  21. [21]

    Evolutionary-scale prediction of atomic- level protein structure with a language model.Science, 379(6637):1123–1130, March 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic- level protein structure with a language model.Science, 379(6637):1123–1130, March 2023

  22. [22]

    Johannes Linder, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and David R. Kelley. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nature Genetics, 57(4):949–961, April 2025

  23. [23]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  24. [24]

    A census-based estimate of Earth’s bacterial and archaeal diversity.PLOS Biology, 17(2):e3000106, February 2019

    Stilianos Louca, Florent Mazel, Michael Doebeli, and Laura Wegener Parfrey. A census-based estimate of Earth’s bacterial and archaeal diversity.PLOS Biology, 17(2):e3000106, February 2019. 12

  25. [25]

    Lu, Alex X

    Amy X. Lu, Alex X. Lu, and Alan Moses. Evolution Is All You Need: Phylogenetic Augmenta- tion for Contrastive Learning, December 2020. arXiv:2012.13475 [q-bio]

  26. [26]

    Lu, Haoran Zhang, Marzyeh Ghassemi, and Alan Moses

    Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, and Alan Moses. Self-supervised contrastive learning of protein representations by mutual information maximization.bioRxiv, 2020

  27. [27]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, September 2020. arXiv:1802.03426 [stat]

  28. [28]

    Moses, Jason E

    Alan M. Moses, Jason E. Stajich, Audrey P. Gasch, and David A. Knowles. Inferring fungal cis- regulatory networks from genome sequences via unsupervised and interpretable representation learning.Genetics, page iyaf209, September 2025

  29. [29]

    A Metric Learning Reality Check

    Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A Metric Learning Reality Check. In Computer Vision – ECCV 2020, pages 681–699, Cham, 2020. Springer International Publishing

  30. [30]

    Durrant, Brian Kang, Dhruva Katrekar, David B

    Eric Nguyen, Michael Poli, Matthew G. Durrant, Brian Kang, Dhruva Katrekar, David B. Li, Liam J. Bartie, Armin W. Thomas, Samuel H. King, Garyk Brixi, Jeremy Sullivan, Madelena Y . Ng, Ashley Lewis, Aaron Lou, Stefano Ermon, Stephen A. Baccus, Tina Hernandez-Boussard, Christopher Ré, Patrick D. Hsu, and Brian L. Hie. Sequence modeling and design from mole...

  31. [31]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation Learning with Contrastive Predictive Coding, January 2019. arXiv:1807.03748 [cs]

  32. [32]

    O’Leary, Eric Cox, J

    Nuala A. O’Leary, Eric Cox, J. Bradley Holmes, W. Ray Anderson, Robert Falk, Vichet Hem, Mirian T. N. Tsuchiya, Gregory D. Schuler, Xuan Zhang, John Torcivia, Anne Ketter, Laurie Breen, Jonathan Cothran, Hena Bajwa, Jovany Tinne, Peter A. Meric, Wratko Hlavina, and Valerie A. Schneider. Exploring and retrieving sequence and metadata for species across the...

  33. [33]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011

  34. [34]

    Stripedhyena: Moving beyond transformers with hybrid signal processing models.GitHub repository, 12, 2023

    Michael Poli, Jue Wang, Stefano Massaroli, Jeffrey Quesnelle, Ryan Carlow, Eric Nguyen, and Armin Thomas. Stripedhyena: Moving beyond transformers with hybrid signal processing models.GitHub repository, 12, 2023

  35. [35]

    Stephen R. Quake. The cellular dogma.Cell, 187(23):6421–6423, November 2024

  36. [36]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceed- ings of the 38th International Conference on Machine Learning, volume 139 ofProceedings...

  37. [37]

    Robust and Decomposable Average Precision for Image Retrieval

    Elias Ramzi, Nicolas Thome, Clément Rambour, Nicolas Audebert, and Xavier Bitot. Robust and Decomposable Average Precision for Image Retrieval. InAdvances in Neural Information Processing Systems, volume 34, pages 23569–23581. Curran Associates, Inc., 2021

  38. [38]

    Lawrence Zitnick, Jerry Ma, and Rob Fergus

    Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.Proceedings of the National Academy of Sciences, 118(15):e2016239118, April 2021

  39. [39]

    iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning.Nucleic Acids Research, 49(D1):D112–D120, January 2021

    Kevin Rychel, Katherine Decker, Anand V Sastry, Patrick V Phaneuf, Saugat Poudel, and Bernhard O Palsson. iModulonDB: a knowledgebase of microbial transcriptional regulation derived from machine learning.Nucleic Acids Research, 49(D1):D112–D120, January 2021. 13

  40. [40]

    RegulonDB v12.0: a comprehensive resource of transcriptional regulation in E

    Heladia Salgado, Socorro Gama-Castro, Paloma Lara, Citlalli Mejia-Almonte, Gabriel Alarcón- Carranza, Andrés G López-Almazo, Felipe Betancourt-Figueroa, Pablo Peña-Loredo, Shirley Alquicira-Hernández, Daniela Ledezma-Tejeida, Lizeth Arizmendi-Zagal, Francisco Mendez- Hernandez, Ana K Diaz-Gomez, Elizabeth Ochoa-Praxedis, Luis J Muñiz-Rascado, Jair S Garcí...

  41. [41]

    Sangster, Cameron Dufault, Haoning Qu, Denise Le, Julie D

    Ami G. Sangster, Cameron Dufault, Haoning Qu, Denise Le, Julie D. Forman-Kay, and Alan M. Moses. Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome.PLOS Computational Biology, 21(11):e1012929, November 2025

  42. [42]

    Sastry, Ye Gao, Richard Szubin, Ying Hefner, Sibei Xu, Donghyuk Kim, Kumari Sonal Choudhary, Laurence Yang, Zachary A

    Anand V . Sastry, Ye Gao, Richard Szubin, Ying Hefner, Sibei Xu, Donghyuk Kim, Kumari Sonal Choudhary, Laurence Yang, Zachary A. King, and Bernhard O. Palsson. The Escherichia coli transcriptome mostly consists of independently regulated modules.Nature Communications, 10(1):5536, December 2019

  43. [43]

    GenBank 2025 update.Nucleic Acids Research, 53(D1):D56–D61, January 2025

    Eric W Sayers, Mark Cavanaugh, Linda Frisse, Kim D Pruitt, Valerie A Schneider, Beverly A Underwood, Linda Yankie, and Ilene Karsch-Mizrachi. GenBank 2025 update.Nucleic Acids Research, 53(D1):D56–D61, January 2025

  44. [44]

    MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, November 2017

    Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, November 2017

  45. [45]

    RoFormer: Enhanced Transformer with Rotary Position Embedding

    Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2021. arXiv:2104.09864v1 [cs]

  46. [46]

    Ziqi Tang, Nirali Somia, Yiyang Yu, and Peter K. Koo. Evaluating the representational power of pre-trained DNA language models for regulatory genomics.Genome Biology, 26(1):203, July 2025

  47. [47]

    Esm cambrian: Revealing the mysteries of proteins with unsupervised learning

    ESM Team et al. Esm cambrian: Revealing the mysteries of proteins with unsupervised learning. EvolutionaryScale Website, 2024

  48. [48]

    Tripp, Kimberly D

    Guillaume Urtecho, Arielle D. Tripp, Kimberly D. Insigne, Hwangbeom Kim, and Sriram Kosuri. Systematic Dissection of Sequence Elements Controlling σ70 Promoters Using a Genomically Encoded Multiplexed Reporter Assay in Escherichia coli.Biochemistry, 58(11):1539–1551, March 2019

  49. [49]

    Attention is All you Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  50. [50]

    Tokenization to transfer: Do genomic foundation models learn good representations? InThe Fourteenth International Conference on Learning Representations, 2026

    Kirill Vishniakov, Karthik Viswanathan, Aleksandr Medvedev, Praveenkumar Kanithi, Marco AF Pimentel, Ronnie Rajan, and Shadab Khan. Tokenization to transfer: Do genomic foundation models learn good representations? InThe Fourteenth International Conference on Learning Representations, 2026

  51. [51]

    Weirauch and Timothy R

    Matthew T. Weirauch and Timothy R. Hughes. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same.Trends in Genetics, 26(2):66–74, February 2010

  52. [52]

    Design prokaryotic cis-regulatory elements using language model.Nucleic Acids Research, 54(4):gkag122, February 2026

    Yan Xia, Jinyuan Sun, Xiaowen Du, Zeyu Liang, Xin Wu, Wenyu Shi, Bin Shao, Shuyuan Guo, and Yi-Xin Huo. Design prokaryotic cis-regulatory elements using language model.Nucleic Acids Research, 54(4):gkag122, February 2026

  53. [53]

    DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.BMC Bioinformatics, 25(1):328, October 2024

    Minghao Yang, Zehua Wang, Zizhuo Yan, Wenxiang Wang, Qian Zhu, and Changlong Jin. DNASimCLR: a contrastive learning-based deep learning approach for gene sequence data classification.BMC Bioinformatics, 25(1):328, October 2024. 14

  54. [54]

    Yu, Winnie L

    Timothy C. Yu, Winnie L. Liu, Marcia S. Brinck, Jessica E. Davis, Jeremy Shek, Grace Bower, Tal Einav, Kimberly D. Insigne, Rob Phillips, Sriram Kosuri, and Guillaume Urtecho. Multi- plexed characterization of rationally designed promoter architectures deconstructs combinatorial logic for IPTG-inducible systems.Nature Communications, 12(1):325, January 2021

  55. [55]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18102–18112, 2022

  56. [56]

    Wayment-Steele, Garyk Brixi, Haobo Wang, Dorothee Kern, and Sergey Ovchinnikov

    Zhidian Zhang, Hannah K. Wayment-Steele, Garyk Brixi, Haobo Wang, Dorothee Kern, and Sergey Ovchinnikov. Protein language models learn evolutionary statistics of interacting sequence motifs.Proceedings of the National Academy of Sciences, 121(45):e2406285121, November 2024. 15 A Dataset creation A.1 Genome sampling and diversity Table 3: Summary of taxono...

  57. [57]

    In their evaluations the authors used the final layer embeddings of the Evo 7B model, averaged over the sequence length and having a dimensionality of 4,096

    architecture, an efficient transformer alternative. In their evaluations the authors used the final layer embeddings of the Evo 7B model, averaged over the sequence length and having a dimensionality of 4,096. We extract embeddings of promoter sequences from Evo in the same manner.Evo2 (7B)

  58. [58]

    2.4T tokens) with sequences from all domains of life (downloaded from https://huggingface.co/arcinstitute/evo2_7b, apache-2.0 license)

    uses a similar training regime and architecture to Evo (7B), but was trained on a significantly larger dataset (300B vs. 2.4T tokens) with sequences from all domains of life (downloaded from https://huggingface.co/arcinstitute/evo2_7b, apache-2.0 license). Rather than extracting embeddings from the final hidden layer, we extract Evo2 embeddings from layer...