pith. machine review for the scientific record. sign in

arxiv: 2605.06720 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Conditional generation of antibody sequences with classifier-guided germline-absorbing discrete diffusion

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords antibody sequence designdiscrete diffusiongermline absorbing statesomatic hypermutationconditional generationclassifier guidanceprotein language models
0
0 comments X

The pith

Germline-absorbing discrete diffusion lets antibody models focus on somatic mutations instead of memorizing germline sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard protein language models for antibodies tend to memorize common germline sequences rather than learning the changes introduced by somatic hypermutation. By changing the discrete diffusion process so that the germline sequence acts as the absorbing state, the model is forced to learn only the trajectory of biological variation from that starting point. This change raises accuracy on predicting non-germline residues from 26 percent to 46 percent. The same trained model then supports classifier-guided sampling that produces antibody sequences with better hydrophobicity or predicted binding affinity, with a stronger balance between property adherence and sequence quality than prior gradient-based sampling methods.

Core claim

Germline absorbing diffusion replaces the usual mask with the germline sequence as the absorbing state in the discrete diffusion noise process. This inductive bias keeps the learned distribution focused on the path from germline to the observed antibody sequence and excludes statistics from genetic variation and V(D)J recombination. As a result the model captures somatic hypermutation more cleanly, yielding the reported accuracy gain on non-germline residues and an improved tradeoff on conditional generation tasks for hydrophobicity and binding affinity.

What carries the argument

Germline-absorbing discrete diffusion: a change to the noise schedule in which the germline sequence, rather than a masked token, serves as the absorbing state.

If this is right

  • Non-germline residue prediction accuracy rises from 26 percent to 46 percent and approaches the limit set by natural biological variability.
  • Classifier-guided sampling produces antibody sequences with measurably better hydrophobicity and predicted binding affinity.
  • The generated sequences maintain higher sample quality for a given level of property adherence than sequences obtained from EvoProtGrad.
  • The same diffusion model can be conditioned on any off-the-shelf classifier without retraining the underlying language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same absorbing-state idea could be tested on other protein families that have a clear ancestral or germline-like reference sequence.
  • Pairing the diffusion model with structure predictors might allow end-to-end design that optimizes both sequence properties and structural developability.
  • If the accuracy gain holds on larger and more diverse antibody datasets, the approach could reduce the number of sequences that must be synthesized and tested in the lab.

Load-bearing premise

That setting the germline sequence as the absorbing state truly removes genetic variation and V(D)J recombination effects from what the model learns while still allowing it to represent real somatic changes.

What would settle it

A replication experiment on held-out antibody sequences in which the germline-absorbing model shows no improvement over standard discrete diffusion in non-germline residue accuracy or in the class-adherence versus quality tradeoff on hydrophobicity or affinity tasks.

Figures

Figures reproduced from arXiv: 2605.06720 by Justin Sanders, Kemal Sonmez, Lan Guo, Luca Giancardo, Melih Yilmaz, Nina Cheng, Yue Zhao.

Figure 1
Figure 1. Figure 1: Discrete diffusion antibody protein language model with germline absorbing state (A) Our model is trained to denoise germline sequences into observed antibody sequences using the score entropy discrete diffusion (SEDD) framework. (B) Once trained our model allows for de novo generation and directed evolution of antibody sequences conditioned on arbitrary classifiers. germline bias. In downstream in silico … view at source ↗
Figure 2
Figure 2. Figure 2: Conditional generation with germline diffusion (A) Tradeoff between class coherence and sample quality when choosing a guidance strength for conditional generation. Antibody sequences conditionally sampled from one of the seven primary V-gene families using a pretrained classifier for guidance. (B) Orthogonal validation using Boltz-2 to predict structure-based binding affinity of sequences from experimenta… view at source ↗
read the original abstract

Antibody therapeutics are among the most successful modern medicines, yet computationally designing antibodies with desirable binding and developability properties remains challenging. While protein language models (pLMs) have emerged as powerful tools for antibody sequence design, existing approaches largely suffer from two key limitations: they predominantly memorize germline sequences rather than modeling biologically meaningful somatic variation, and they offer limited support for flexible classifier-guided conditional generation. We address these challenges through two primary contributions. First, we demonstrate that discrete diffusion fine-tuning achieves strong language modeling performance on antibody sequences while allowing for generation conditioned on any off-the-shelf classifier. Second, we introduce germline absorbing diffusion, a novel modification of the discrete diffusion noise process in which the germline sequence - rather than a masked sequence - serves as the absorbing state. This biologically motivated inductive bias restricts the model to learning the trajectory from germline to observed sequence, effectively excluding genetic variation and V(D)J recombination statistics from the learned distribution and dramatically mitigating germline bias. We show that germline diffusion improves non-germline residue prediction accuracy from 26 percent to 46 percent, approaching the theoretical upper bound set by true biological variability. We then demonstrate the utility of our germline diffusion model on the conditional generation tasks of sampling antibodies with improved hydrophobicity and predicted binding affinity. On both tasks our model shows an improved tradeoff between class adherence and sample quality, significantly outperforming EvoProtGrad, a popular strategy to sample from pLMs with gradient-based discrete Markov Chain Monte Carlo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes germline-absorbing discrete diffusion as a modification to standard discrete diffusion models for antibody sequences. By designating the germline sequence (rather than a mask) as the absorbing state in the forward noise process, the model is intended to learn only the trajectory of somatic hypermutation while excluding V(D)J recombination statistics. The central empirical claims are that this yields a lift in non-germline residue prediction accuracy from 26% to 46% (approaching a theoretical upper bound set by biological variability) and produces superior class-adherence versus sample-quality trade-offs on two classifier-guided conditional generation tasks (hydrophobicity and predicted binding affinity) relative to EvoProtGrad.

Significance. If the reported gains are shown to arise from improved capture of somatic variation rather than task asymmetry, the work would supply a biologically motivated inductive bias that is compatible with any off-the-shelf classifier for conditional sampling. This flexibility and the explicit separation of germline versus somatic statistics constitute a clear methodological contribution to the growing literature on diffusion models for protein design.

major comments (3)
  1. [Results section on non-germline residue prediction] The headline result that germline-absorbing diffusion raises non-germline residue prediction accuracy from 26% to 46% (Abstract and Results section) is load-bearing for the claim of reduced germline bias. The manuscript must explicitly state whether the germline sequence is supplied as the starting point or conditioning input to the denoising network during this evaluation; if it is, the comparison to an unconditional pLM baseline conflates two distinct inference problems and does not isolate the effect of the absorbing-state modification.
  2. [Abstract and corresponding Results paragraph] The statement that 46% 'approaches the theoretical upper bound set by true biological variability' (Abstract) is central to interpreting the magnitude of improvement, yet no equation, dataset, or procedure is given for computing this bound. A concrete description—e.g., how replicate sequences or per-position entropy were used—must be added so readers can verify whether the bound is task-appropriate and whether the model is genuinely close to it.
  3. [Conditional generation experiments subsection] The conditional-generation experiments (hydrophobicity and binding-affinity tasks) claim an 'improved tradeoff between class adherence and sample quality' and statistically significant outperformance of EvoProtGrad. The manuscript must supply the precise classifier architecture, guidance scale schedule, quantitative metrics (with error bars and sample sizes), and the exact definition of 'sample quality' used in the comparison; without these, the practical utility claim cannot be evaluated.
minor comments (2)
  1. [Introduction] The term 'germline bias' is used repeatedly but never given a formal definition or quantitative metric; a short paragraph in the Introduction or Methods would clarify what quantity is being mitigated.
  2. [Figures and captions] Figure captions and axis labels should explicitly indicate whether reported accuracies are per-residue, per-sequence, or averaged over a held-out test set, and whether the same train/test splits are used for all baselines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below, with revisions made to the relevant sections.

read point-by-point responses
  1. Referee: [Results section on non-germline residue prediction] The headline result that germline-absorbing diffusion raises non-germline residue prediction accuracy from 26% to 46% (Abstract and Results section) is load-bearing for the claim of reduced germline bias. The manuscript must explicitly state whether the germline sequence is supplied as the starting point or conditioning input to the denoising network during this evaluation; if it is, the comparison to an unconditional pLM baseline conflates two distinct inference problems and does not isolate the effect of the absorbing-state modification.

    Authors: We agree that explicit clarification is required. In the germline-absorbing model, the germline sequence is supplied as the absorbing state and thus serves as the starting point for the reverse denoising process during non-germline residue prediction. The unconditional pLM baseline is evaluated without any germline input. While this setup intentionally contrasts a germline-conditioned diffusion process against an unconditional baseline to demonstrate the bias-reduction effect of the absorbing-state modification, we acknowledge the potential for conflation. In the revised manuscript we have added an explicit description of the inference procedure in the Results section and included a new ablation comparing against a standard (non-germline-absorbing) discrete diffusion model conditioned on the germline sequence. This isolates the contribution of the absorbing-state change. revision: yes

  2. Referee: [Abstract and corresponding Results paragraph] The statement that 46% 'approaches the theoretical upper bound set by true biological variability' (Abstract) is central to interpreting the magnitude of improvement, yet no equation, dataset, or procedure is given for computing this bound. A concrete description—e.g., how replicate sequences or per-position entropy were used—must be added so readers can verify whether the bound is task-appropriate and whether the model is genuinely close to it.

    Authors: We thank the referee for highlighting this omission. The upper bound was derived from per-position amino-acid entropy computed across replicate sequences belonging to the same clonal lineage in the training dataset, which captures both sequencing noise and true biological somatic variability. We have added a new paragraph in the Methods section with the exact formula (upper bound = 1 - mean_entropy / log(20)) and the dataset subset used (clonal replicates with at least three sequences per lineage). The revised Abstract and Results now reference this computation and report that the observed 46% accuracy reaches approximately 88% of the estimated bound of 52%. revision: yes

  3. Referee: [Conditional generation experiments subsection] The conditional-generation experiments (hydrophobicity and binding-affinity tasks) claim an 'improved tradeoff between class adherence and sample quality' and statistically significant outperformance of EvoProtGrad. The manuscript must supply the precise classifier architecture, guidance scale schedule, quantitative metrics (with error bars and sample sizes), and the exact definition of 'sample quality' used in the comparison; without these, the practical utility claim cannot be evaluated.

    Authors: We agree that these experimental details must be fully specified. The revised Methods section now states: (i) the classifier is a fine-tuned ESM-2 model with a linear regression head trained on hydrophobicity and predicted affinity labels; (ii) classifier-free guidance is applied with a linear schedule from scale 0.5 to 3.0 across denoising timesteps; (iii) sample quality is defined as the average log-perplexity under a held-out antibody-specific pLM; (iv) all metrics are reported as means with standard error over three independent runs of 500 samples each, with statistical significance assessed by two-sided t-tests. Updated figures and tables include error bars and the requested quantitative values. revision: yes

Circularity Check

0 steps flagged

No circularity in claimed results or modeling chain

full rationale

The paper's core contributions are an empirical modification to discrete diffusion (germline as absorbing state) plus classifier-guided sampling. Reported gains (26% to 46% non-germline accuracy, outperformance vs EvoProtGrad) are presented as experimental outcomes on held-out sequences and external baselines, with no equations, fitted parameters, or self-citations that reduce the quantitative claims to the training inputs by construction. The modeling choices are described as biologically motivated inductive biases rather than derived from prior self-referential results, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that germline sequences represent the appropriate biological starting point for modeling somatic hypermutation trajectories, with no free parameters or new entities explicitly introduced beyond the diffusion framework itself.

axioms (1)
  • domain assumption Germline sequences serve as the natural absorbing state for antibody sequence evolution trajectories.
    Invoked in the description of the germline-absorbing diffusion noise process to justify excluding V(D)J recombination statistics.

pith-pipeline@v0.9.0 · 5579 in / 1313 out tokens · 43605 ms · 2026-05-11T00:55:26.716668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    RosettaAntibodyDesign (RAbD): A general framework for computational antibody design.PLOS Computational Biology, 14(4):e1006112, 2018

    Jared Adolf-Bryfogle, Oleks Kalyuzhniy, Michael Kubitz, Brian D Weitzner, Xiaozhen Hu, Yumiko Adachi, William R Schief, and Roland L Dunbrack. RosettaAntibodyDesign (RAbD): A general framework for computational antibody design.PLOS Computational Biology, 14(4):e1006112, 2018

  2. [2]

    Protein generation with evolutionary diffusion: sequence is all you need

    Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex X Lu, Nicolo Fusi, Ava P Amini, and Kevin K Yang. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pages 2023–09, 2023

  3. [3]

    Struc- tured denoising diffusion models in discrete state-spaces

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Struc- tured denoising diffusion models in discrete state-spaces. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 17981–17993, 2021

  4. [4]

    Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability.Communications Biology, 7(1):922, 2024

    Habib Bashour, Eva Smorodina, Matteo Pariset, Jahn Zhong, Rahmad Akbar, Maria Chernigov- skaya, Khang Lê Quý, Igor Snapkow, Puneet Rawat, Konrad Krawczyk, Geir Kjetil Sandve, Jose Gutierrez-Marcos, Daniel Nakhaee-Zadeh Gutierrez, Jan Terje Andersen, and Victor Greiff. Biophysical cartography of the native and human-engineered antibody landscapes quantifie...

  5. [5]

    A continuous time framework for discrete denoising models

    Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. InAdvances in Neural Information Processing Systems, 2022

  6. [6]

    Conformations of immunoglobulin hypervariable regions.Nature, 342(6252):877–883, 1989

    Cyrus Chothia, Arthur M Lesk, Anna Tramontano, Michael Levitt, Sandra J Smith-Gill, Gillian Air, Steven Sheriff, Eduardo A Padlan, David Davies, William R Tulip, et al. Conformations of immunoglobulin hypervariable regions.Nature, 342(6252):877–883, 1989

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018

  8. [8]

    ProtTrans: Toward understanding the language of life through self-supervised learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, 2021

    Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, et al. ProtTrans: Toward understanding the language of life through self-supervised learning.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7112–7127, 2021

  9. [9]

    Patrick Emami, Aidan Perreault, Jeffrey Law, David Biagioni, and Peter St. John. Plug & play directed evolution of proteins with gradient-based discrete MCMC.Machine Learning: Science and Technology, 4(2):025014, 2023

  10. [10]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000–16009, 2022. 10

  11. [11]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), volume 33, pages 6840–6851, 2020

  12. [12]

    Argmax flows and multinomial diffusion: Learning categorical distributions

    Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. InAdvances in Neural Information Processing Systems (NeurIPS), volume 34, pages 12454–12465, 2021

  13. [13]

    Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

    John B Ingraham, Max Baranov, Zak Costello, Vincent Frappier, Ahmed Ismail, Shan Tie, Wujie Wang, Vincent Xu, Anna Zakharova, Eric Gao, et al. Illuminating protein space with a programmable generative model.Nature, 623(7989):1070–1078, 2023

  14. [14]

    Nett, Beth Sharkey, Beata Bobrowicz, Isabelle Caffry, Yao Yu, Yuan Cao, Heather Lynaugh, Michael Brown, Hemanta Baruah, Laura T

    Tushar Jain, Tingwan Sun, Stéphanie Durand, Amy Hall, Nga Rewa Houston, Juergen H. Nett, Beth Sharkey, Beata Bobrowicz, Isabelle Caffry, Yao Yu, Yuan Cao, Heather Lynaugh, Michael Brown, Hemanta Baruah, Laura T. Gray, Eric M. Krauland, Yingda Xu, Maximiliano Vásquez, and K. Dane Wittrup. Biophysical properties of the clinical-stage antibody landscape. Pro...

  15. [15]

    Highly accurate protein structure prediction with AlphaFold.Nature, 596(7873):583–589, 2021

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with AlphaFold.Nature, 596(7873):583–589, 2021

  16. [16]

    DiffWave: A versatile diffusion model for audio synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. InInternational Conference on Learning Representations (ICLR), 2021

  17. [17]

    Observed antibody space: A resource for data mining next-generation sequencing of antibody repertoires.Journal of Immunology, 201(8):2502–2509, 2018

    Aleksandr Kovaltsuk, Jinwoo Leem, Sebastian Kelm, James Snowden, Charlotte M Deane, and Konrad Krawczyk. Observed antibody space: A resource for data mining next-generation sequencing of antibody repertoires.Journal of Immunology, 201(8):2502–2509, 2018

  18. [18]

    Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model.Science, 379(6637):1123–1130, 2023

  19. [19]

    Discrete diffusion modeling by estimating the ratios of the data distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. InProceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2024

  20. [20]

    Development of therapeutic antibodies for the treatment of diseases.Journal of Biomedical Science, 27(1):1, 2020

    Ruei-Min Lu, Yu-Chyi Hwang, I-Ju Liu, Chi-Chiu Lee, Hsin-Zung Tsai, Hung-Jen Li, and Han-Chung Wu. Development of therapeutic antibodies for the treatment of diseases.Journal of Biomedical Science, 27(1):1, 2020

  21. [21]

    Antigen- specific antibody design and optimization with diffusion-based generative models for protein structures

    Shitong Luo, Yufeng Su, Xingang Peng, Sheng Wang, Jian Peng, and Jianzhu Ma. Antigen- specific antibody design and optimization with diffusion-based generative models for protein structures. InAdvances in Neural Information Processing Systems (NeurIPS), volume 35, pages 9754–9767, 2022

  22. [22]

    Large language models generate functional protein sequences across diverse families.Nature Biotechnology, 41(8):1099–1106, 2023

    Ali Madani, Ben Krause, Eric R Greene, Subu Subramanian, Benjamin P Mohr, James M Holton, Jose Luis Olmos, Caiming Xiong, Zachary Z Sun, Richard Socher, et al. Large language models generate functional protein sequences across diverse families.Nature Biotechnology, 41(8):1099–1106, 2023

  23. [23]

    Makowski, Patrick C

    Emily K. Makowski, Patrick C. Kinnunen, Jie Huang, Lina Wu, Matthew D. Smith, Tiexin Wang, Alec A. Desai, Craig N. Streu, Yulei Zhang, Jennifer M. Zupancic, John S. Schardt, Jennifer J. Linderman, and Peter M. Tessier. Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space.N...

  24. [24]

    AbDiffuser: Full-atom generation of in-vitro functioning antibodies

    Karolis Martinkus, Jan Ludwiczak, Wei-Ching Cho, Luke Lairson, Nathan Frey, Andreas Müller, Richard Bonneau, Andrew Watkins, Dip Bhatt, and Debora S Marks. AbDiffuser: Full-atom generation of in-vitro functioning antibodies. InAdvances in Neural Information Processing Systems (NeurIPS), volume 36, 2023. 11

  25. [25]

    Separating selection from mutation in antibody language models.eLife, 14:RP109644, 2025

    Frederick A Matsen IV , Will Dumm, Kevin Sung, Mackenzie M Johnson, David Rich, Tyler Starr, Yun S Song, Julia Fukuyama, and Hugh K Haddox. Separating selection from mutation in antibody language models.eLife, 14:RP109644, 2025. Reviewed preprint

  26. [26]

    FDA approves 100th monoclonal antibody product.Nature Reviews Drug Discovery, 20(7):491–495, 2021

    Asher Mullard. FDA approves 100th monoclonal antibody product.Nature Reviews Drug Discovery, 20(7):491–495, 2021

  27. [27]

    ProGen2: Exploring the boundaries of protein language models.Cell Systems, 14(11):968–978, 2023

    Erik Nijkamp, Jeffrey A Ruffolo, Eli N Weinstein, Nikhil Naik, and Ali Madani. ProGen2: Exploring the boundaries of protein language models.Cell Systems, 14(11):968–978, 2023

  28. [28]

    A new clustering of antibody CDR loop conformations.Journal of Molecular Biology, 406(2):228–256, 2011

    Benjamin North, Andreas Lehmann, and Roland L Dunbrack. A new clustering of antibody CDR loop conformations.Journal of Molecular Biology, 406(2):228–256, 2011

  29. [29]

    AbLang: An antibody language model for completing incomplete antibody sequences.Bioinformatics Advances, 2(1):vbac046, 2022

    Tobias H Olsen, Fergus Boyles, and Charlotte M Deane. AbLang: An antibody language model for completing incomplete antibody sequences.Bioinformatics Advances, 2(1):vbac046, 2022

  30. [30]

    Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences.Protein Science, 31(1):141–146, 2022

    Tobias H Olsen, Fergus Boyles, and Charlotte M Deane. Observed antibody space: A diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences.Protein Science, 31(1):141–146, 2022

  31. [31]

    Addressing the antibody germline bias and its effect on language models for improved antibody design.Bioinformatics, 40(11):btae618, 2024

    Tobias H Olsen, Iain H Moal, and Charlotte M Deane. Addressing the antibody germline bias and its effect on language models for improved antibody design.Bioinformatics, 40(11):btae618, 2024

  32. [32]

    Your absorbing discrete diffusion secretly models the conditional distributions of clean data

    Jingyang Ou, Shen Nie, Kaiwen Xue, Fengqi Zhu, Jiacheng Sun, Zhenguo Li, and Chongxuan Li. Your absorbing discrete diffusion secretly models the conditional distributions of clean data. InInternational Conference on Learning Representations (ICLR), 2025

  33. [33]

    Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025

    Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Dominique Beaini, Tommi Jaakkola, and Regina Barzilay. Boltz-2: Towards accurate and efficient binding affinity prediction.bioRxiv, 2025

  34. [34]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  35. [35]

    Matthew I. J. Raybould, Claire Marks, Konrad Krawczyk, Bruck Taddese, Jaroslaw Nowak, Alan P. Lewis, Alexander Bujotzek, Jiye Shi, and Charlotte M. Deane. Five computational developability guidelines for therapeutic antibody profiling.Proceedings of the National Academy of Sciences, 116(10):4025–4030, 2019

  36. [36]

    Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021

  37. [37]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022

  38. [38]

    A., Gray, J

    Jeffrey A Ruffolo, Jeffrey J Gray, and Jeremias Sulam. Deciphering antibody affinity maturation with language models and weakly supervised learning.arXiv preprint arXiv:2112.07782, 2021

  39. [39]

    de Almeida, Alexander Rush, Thomas Pierrot, and V olodymyr Kuleshov

    Yair Schiff, Subham Sekhar Sahoo, Hao Phung, Guanghan Wang, Sam Boshar, Hugo Dalla- Torre, Bernardo P. de Almeida, Alexander Rush, Thomas Pierrot, and V olodymyr Kuleshov. Simple guidance mechanisms for discrete diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025

  40. [40]

    Generative language modeling for antibody design.Cell Systems, 14(11):979–989, 2023

    Richard W Shuai, Jeffrey A Ruffolo, and Jeffrey J Gray. Generative language modeling for antibody design.Cell Systems, 14(11):979–989, 2023

  41. [41]

    Toward high-resolution homology modeling of antibody Fv regions and application to antibody–antigen docking.Proteins: Structure, Function, and Bioinformatics, 74(2):497–514, 2009

    Arvind Sivasubramanian, Aroop Sircar, Sidhartha Chaudhury, and Jeffrey J Gray. Toward high-resolution homology modeling of antibody Fv regions and application to antibody–antigen docking.Proteins: Structure, Function, and Bioinformatics, 74(2):497–514, 2009. 12

  42. [42]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InProceedings of the 32nd International Conference on Machine Learning (ICML), pages 2256–2265. PMLR, 2015

  43. [43]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

  44. [44]

    MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, 2017

    Martin Steinegger and Johannes Söding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.Nature Biotechnology, 35(11):1026–1028, 2017

  45. [46]

    Clustering huge protein sequence sets in linear time

    Martin Steinegger and Johannes Söding. Clustering huge protein sequence sets in linear time. Nature Communications, 9(1):2542, 2018

  46. [47]

    DPLM: Diffusion protein language model

    Xinyou Wang, Zaixiang Zheng, Fei Ye, Dongyu Xue, Shujian Huang, and Quanquan Gu. DPLM: Diffusion protein language model. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

  47. [48]

    De novo design of protein structure and function with RFdiffusion.Nature, 620(7976):1089–1100, 2023

    Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with RFdiffusion.Nature, 620(7976):1089–1100, 2023

  48. [49]

    SE(3) diffusion model with application to protein backbone generation

    Jason Yim, Brian L Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. SE(3) diffusion model with application to protein backbone generation. InProceedings of the 40th International Conference on Machine Learning (ICML), pages 40001–40039. PMLR, 2023. 13 S1 Supplementary Figures (A) (B) (C) Figure S1:Performance...