Fast and Interpretable Protein Substructure Alignment via Optimal Transport

Bingxin Zhou; Jing Wang; Liang Hong; Pietro Li\`o; Weishu Zhao; Yang Tan; Zhiyu Wang

arxiv: 2510.11752 · v2 · submitted 2025-10-12 · 🧬 q-bio.QM · cs.AI· cs.LG

Fast and Interpretable Protein Substructure Alignment via Optimal Transport

Zhiyu Wang , Bingxin Zhou , Jing Wang , Yang Tan , Weishu Zhao , Pietro Li\`o , Liang Hong This is my paper

Pith reviewed 2026-05-18 08:24 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIcs.LG

keywords protein structure alignmentoptimal transportlocal structural motifsresidue-level alignmentSinkhorn iterationsinterpretable alignmentfunctional annotation

0 comments

The pith

Protein local structural alignment can be recast as regularized optimal transport to produce fast, accurate, and interpretable residue-level matches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that local protein motifs such as active sites can be compared by turning the alignment task into a regularized optimal transport problem solved with differentiable Sinkhorn iterations. For any pair of protein structures the method returns an explicit alignment matrix together with an overall similarity score. Extensive tests and three biological case studies indicate that the resulting alignments are both computationally light and biologically useful. A training-free variant is supplied for settings where labeled data are unavailable. The approach therefore fills a practical gap in tools that link protein structure to function, evolution, and engineering.

Core claim

PLASMA reformulates residue-level local structural alignment as a regularized optimal transport problem and solves it with differentiable Sinkhorn iterations, yielding an explicit alignment matrix and an interpretable similarity score for any input pair of protein structures.

What carries the argument

regularized optimal transport task solved by differentiable Sinkhorn iterations that produces the alignment matrix

If this is right

Residue-level alignments become directly usable for functional annotation of local motifs.
The method supplies a lightweight alternative to heavier existing alignment tools for large protein sets.
The explicit alignment matrix supports downstream evolutionary and engineering analyses.
The training-free PLASMA-PF variant extends applicability to data-scarce settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same transport formulation could be tested on nucleic-acid or ligand substructures to check generality beyond proteins.
The interpretability of the alignment matrix may allow direct comparison against experimental cross-linking or mutagenesis data.
If the cost function is further tuned, the method might scale to proteome-wide motif searches.

Load-bearing premise

The reformulation of local structural alignment as a regularized optimal transport task with a suitable cost function will produce alignments that are biologically meaningful rather than merely mathematically convenient.

What would settle it

If the alignments returned by PLASMA on the three reported biological case studies fail to recover known functional residues or active-site correspondences, the claim of biological utility is falsified.

Figures

Figures reproduced from arXiv: 2510.11752 by Bingxin Zhou, Jing Wang, Liang Hong, Pietro Li\`o, Weishu Zhao, Yang Tan, Zhiyu Wang.

**Figure 2.** Figure 2: Performance versus computational efficiency comparison. ROC-AUC scores plotted [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Alignment quality analysis across four different embedding-based approaches. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Label Match Score comparison between PLASMA and PLASMA-PF across different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Representative alignment examples across three protein pairs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Representative alignment matrices comparing query protein P76129 against six candidate [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of Sinkhorn temperature parameter [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Performance vs dataset fraction. PLASMA demonstrates high performance in predicting [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Performance vs hidden dimension size of the siamese network. While PLASMA’s per [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Performance vs Sinkhorn temperature (τ ). PLASMA’s performance remains stably high within the 0.1–1 range, but when out of this range, PLASMA’s performance noticeably drops. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Performance vs number of Sinkhorn iterations [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Performance vs the kernel size of the diagonal convolution ( [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Performance vs residue matching threshold ( [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Alignment matrix visualizations of random positive pairs from [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: Alignment matrix visualizations of random positive pairs from [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: Alignment matrix visualizations of random positive pairs from [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

read the original abstract

Proteins are essential biological macromolecules that execute life functions. Local structural motifs, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, a deep-learning-based framework for efficient and interpretable residue-level local structural alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at https://github.com/ZW471/PLASMA-Protein-Local-Alignment.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PLASMA gives a clean optimal transport reformulation for residue-level protein substructure alignment with an interpretable matrix output, but the biological relevance of those alignments rests on an assumption that still needs direct checks.

read the letter

The main point is that this paper turns local protein motif alignment into a regularized optimal transport problem solved via differentiable Sinkhorn iterations, plus a training-free variant called PLASMA-PF. That specific combination for substructure-level work is new enough to stand out from prior OT uses in bioinformatics. The output is an explicit alignment matrix and a scalar similarity score, which keeps things lightweight and readable for downstream tasks like functional annotation or motif comparison. Releasing the code is also a practical plus that lets others inspect the implementation directly. The approach handles pairs of structures efficiently and avoids some of the heavier machinery in existing tools. On the results side, the abstract points to quantitative tests and three biological case studies, which at least shows an attempt to connect the math to real proteins rather than staying purely synthetic. The soft spots sit mostly in the validation and the central assumption. Without the full methods, data splits, baseline comparisons, or error breakdowns, it's difficult to judge whether the reported accuracy is robust or influenced by post-hoc choices. The load-bearing claim is that the cost function and resulting transport plan will surface functionally relevant residues instead of just geometrically close ones. If the features lean heavily on local structure or embeddings without functional labels, the high-mass entries could be mathematically optimal yet incidental to biology. That assumption is plausible but not automatic, and the case studies would need to show clear gains over simpler geometric matching. This is aimed at computational biologists who routinely compare local structures or annotate functions from PDB files. A reader who wants a fast, residue-resolved aligner with some interpretability could try it out if the experiments check out. I would send it for peer review. The formulation is coherent and the problem matters, so referees can test whether the alignments actually deliver on the biological side.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PLASMA, a deep-learning framework that reformulates local structural alignment of protein substructures as a regularized optimal transport problem solved via differentiable Sinkhorn iterations. For input protein structure pairs it produces an interpretable residue-level alignment matrix together with an overall similarity score. A training-free variant (PLASMA-PF) is also presented. The authors support the claims of accuracy, lightness and interpretability with quantitative evaluations and three biological case studies, positioning the method for functional annotation, evolutionary analysis and structure-based drug design. Reproducibility is asserted via a public GitHub repository.

Significance. If the central claims hold, the work could supply a lightweight, interpretable tool for comparing local protein motifs that existing methods handle poorly. The explicit alignment matrix and the availability of both learned and parameter-light variants are practical strengths. Open code further supports verification and reuse. Significance ultimately hinges on whether the OT-derived alignments recover functionally relevant residues rather than purely geometric correspondences.

major comments (2)

[Abstract] Abstract: the assertion of 'accurate' residue-level alignments supported by 'extensive quantitative evaluations' is load-bearing for the central claim, yet the manuscript provides insufficient detail on data splits, baseline selection, and error analysis. Without these elements it is impossible to rule out post-hoc selection or to confirm that the reported metrics genuinely validate the OT formulation.
[Methods] Methods (regularized OT formulation and cost function): the claim that the Sinkhorn-derived transport plan yields biologically meaningful alignments rests on the unverified assumption that high-mass entries correspond to functional or conserved residues rather than local geometric or embedding similarities. This assumption is central to the biological utility asserted in the case studies and requires explicit validation (e.g., overlap with known active-site residues or motif databases) to be load-bearing.

minor comments (1)

[Implementation and Reproducibility] The GitHub repository is cited for reproducibility, but the manuscript should explicitly list which data splits, trained weights, and evaluation scripts are included so that the quantitative results can be regenerated without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below. Revisions have been made to provide additional details on the evaluation protocol and to include explicit quantitative validation of the biological relevance of the alignments.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion of 'accurate' residue-level alignments supported by 'extensive quantitative evaluations' is load-bearing for the central claim, yet the manuscript provides insufficient detail on data splits, baseline selection, and error analysis. Without these elements it is impossible to rule out post-hoc selection or to confirm that the reported metrics genuinely validate the OT formulation.

Authors: We agree that greater transparency on the evaluation setup is warranted. In the revised manuscript we have updated the abstract for precision and expanded Section 3 (Methods) and the Supplementary Information to specify: (i) data splits performed at the protein level with a 30% sequence-identity cutoff to prevent leakage, (ii) baseline selection rationale (TM-align, US-align, Foldseek, and a simple Euclidean-distance OT variant), and (iii) full error analysis including standard deviations over five random seeds and per-residue precision-recall curves. These additions demonstrate that performance gains are reproducible on held-out sets and arise from the regularized OT objective rather than selective reporting. revision: yes
Referee: [Methods] Methods (regularized OT formulation and cost function): the claim that the Sinkhorn-derived transport plan yields biologically meaningful alignments rests on the unverified assumption that high-mass entries correspond to functional or conserved residues rather than local geometric or embedding similarities. This assumption is central to the biological utility asserted in the case studies and requires explicit validation (e.g., overlap with known active-site residues or motif databases) to be load-bearing.

Authors: We acknowledge that an explicit quantitative link between transport mass and functional annotation strengthens the central claim. The three case studies already illustrate recovery of catalytic residues, but to address the concern directly we have added a new analysis in the Results section that computes the overlap between high-mass entries (top 10% of the transport plan) and annotated active-site residues from the Catalytic Site Atlas as well as PROSITE motifs. The revised text reports statistically significant enrichment (hypergeometric p < 0.01) relative to both random assignment and a purely geometric baseline, supporting that the learned cost function and Sinkhorn plan capture functional correspondences beyond local geometry alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard OT reformulation applied to protein substructures with independent benchmarks

full rationale

The paper reformulates local structural alignment as regularized optimal transport solved via differentiable Sinkhorn iterations, which is a direct application of established mathematical tools rather than a self-referential derivation. The deep-learning component extracts residue features or defines costs, but performance claims rest on quantitative evaluations against external benchmarks and three biological case studies, not on renaming fitted parameters as predictions. The training-free PLASMA-PF variant further decouples results from learned weights. No load-bearing step reduces by construction to the paper's own inputs or self-citations; the central claims remain externally falsifiable via standard alignment metrics and functional annotations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that structural similarity can be faithfully captured by an optimal transport cost matrix and on a small number of regularization and learning hyperparameters whose values are not independently derived.

free parameters (1)

regularization strength
Controls the smoothness of the transport plan and must be chosen or tuned for the protein alignment task.

axioms (1)

domain assumption A cost function based on local structural features can be defined such that optimal transport yields biologically relevant residue alignments.
Invoked when the problem is reformulated as regularized optimal transport.

pith-pipeline@v0.9.0 · 5766 in / 1216 out tokens · 32220 ms · 2026-05-18T08:24:31.804685+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations... Cij = ||[ϕθ(LN(hq,i))−ϕθ(LN(hc,j))]+||1
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The substructure similarity score s is defined as the cosine similarity between the summed representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

doi: 10.1371/journal.pcbi.1008502

ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008502. Matthias Blum, Antonina Andreeva, Laise Cavalcanti Florentino, Sara Rocio Chuguransky, Tiago Grego, Emma Hobbs, Beatriz Lazaro Pinto, Ailsa Orr, Typhaine Paysan-Lafosse, Irina Pona- mareva, Gustavo A Salazar, Nicola Bordin, Peer Bork, Alan Bridge, Lucy Colwell, Julian Gough, Daniel H Haft, Ivica Letunic, ...

work page doi:10.1371/journal.pcbi.1008502 2025
[2]

doi: 10.1093/nar/gkae1082

ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/gkae1082. Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. ProteinBERT: A universal deep-learning model of protein sequence and function.Bioinformatics, 38(8):2102– 2110,

work page doi:10.1093/nar/gkae1082
[3]

ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics, 38(8): 2102–2110, February 2022

ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btac020. Luis Caffarelli and Robert McCann. Free boundaries in optimal transport and monge-amp `ere obstacle problems.Annals of Mathematics, 171(2):673–730,

work page doi:10.1093/bioinformatics/btac020
[4]

doi: 10.4007/annals.2010.171.673

ISSN 0003-486X. doi: 10.4007/annals.2010.171.673. Gaofeng Cui, Beiyan Nan, Jicheng Hu, Yiping Wang, Changwen Jin, and Bin Xia. Identification and solution structures of a single domain biotin/lipoyl attachment protein from bacillus subtilis. Journal of Biological Chemistry, 281(29):20598–20607,

work page doi:10.4007/annals.2010.171.673 2010
[5]

Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang

doi: 10.5555/2999792.2999868. Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang. mTM-align: an algorithm for fast and accurate multiple protein structure alignment.Bioinformatics, 34(10):1719–1725,

work page doi:10.5555/2999792.2999868
[6]

Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,

Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Char- lotte Rochereau, and Burkhard Rost. Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,

work page arXiv
[7]

doi: 10.1007/s00205-008-0212-7

ISSN 0003-9527, 1432-0673. doi: 10.1007/s00205-008-0212-7. Karen R Groot, Lisa M Sevilla, Kazunori Nishi, Teresa DiColandrea, and Fiona M Watt. Kazrin, a novel periplakin-interacting protein associated with desmosomes and the keratinocyte plasma membrane.The Journal of cell biology, 166(5):653–659,

work page doi:10.1007/s00205-008-0212-7
[8]

doi: 10.1038/s41587-023-01917-2

ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01917-2. 11 Preprint. Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, and Burkhard Rost. Bilingual language model for protein sequence and structure.NAR Genomics and Bioinformatics, 6(4):lqae150,

work page doi:10.1038/s41587-023-01917-2
[9]

doi: 10.1093/nargab/lqae150

ISSN 2631-9268. doi: 10.1093/nargab/lqae150. Liisa Holm. Using dali for protein structure comparison. In Zolt ´an G´asp´ari (ed.),Structural Bioin- formatics, volume 2112, pp. 29–42. Springer US,

work page doi:10.1093/nargab/lqae150
[10]

doi: 10.1007/978-1-0716-0270-6

ISBN 978-1-0716-0269-0 978-1-0716- 0270-6. doi: 10.1007/978-1-0716-0270-6

work page doi:10.1007/978-1-0716-0270-6
[11]

Jamasb, Alex Morehead, Chaitanya K

Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V . Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, and Tom L. Blundell. Eval- uating representation learning on the protein structure universe. InICLR 2024,

work page 2024
[12]

doi: 10.48550/arXiv.2406.13864. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇ Z ´ıdek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera- Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Tre...

work page doi:10.48550/arxiv.2406.13864
[13]

Nature , author =

ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03819-2. Kamil Kaminski, Jan Ludwiczak, Kamil Pawlicki, Vikram Alva, and Stanislaw Dunin-Horkawicz. pLM-BLAST: Distant homology detection based on direct comparison of sequence representa- tions from protein language models.Bioinformatics, 39(10):btad579,

work page doi:10.1038/s41586-021-03819-2
[14]

doi: 10.1093/bioinformatics/btad579

ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad579. Hyunbin Kim, Rachel Seongeun Kim, Milot Mirdita, and Martin Steinegger. Structural motif search across the protein-universe with folddisco.bioRxiv, pp. 2025–07,

work page doi:10.1093/bioinformatics/btad579 2025
[15]

doi: 10.1186/1479-7364-4-3-207

ISSN 1479-7364. doi: 10.1186/1479-7364-4-3-207. Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan Dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic- level protein structure with a languag...

work page doi:10.1186/1479-7364-4-3-207
[16]

1126/science.ade2574

ISSN 0036- 8075, 1095-9203. doi: 10.1126/science.ade2574. Wei Liu, Ziye Wang, Ronghui You, Chenghan Xie, Hong Wei, Yi Xiong, Jianyi Yang, and Shan- feng Zhu. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology.Nature Communications, 15(1):2775,

work page doi:10.1126/science.ade2574
[17]

doi: 10.1038/s41467-024-46808-5

ISSN 2041-1723. doi: 10.1038/s41467-024-46808-5. Yang Liu, Qing Ye, Liwei Wang, and Jian Peng. Learning structural motif representations for efficient protein structure search.Bioinformatics, 34(17):i773–i780,

work page doi:10.1038/s41467-024-46808-5 2041
[18]

doi: 10.1093/bioinformatics/btad786

ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad786. Vaibhav Raj, Indradyumna Roy, Ashwin Ramachandran, Soumen Chakrabarti, and Abir De. Chart- ing the design space of neural graph representations for subgraph matching. InICLR 2025,

work page doi:10.1093/bioinformatics/btad786 2025
[19]

Itera- tively refined early interaction alignment for subgraph matching based graph retrieval

Ashwin Ramachandran, Vaibhav Raj, Indrayumna Roy, Soumen Chakrabarti, and Abir De. Itera- tively refined early interaction alignment for subgraph matching based graph retrieval. InNeurIPS 2024,

work page 2024
[20]

doi: 10.2140/pjm.1967.21.343

ISSN 0030-8730, 0030-8730. doi: 10.2140/pjm.1967.21.343. Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H. Wu. UniRef: Comprehensive and non-redundant UniProt reference clusters.Bioinformatics, 23(10):1282– 1288,

work page doi:10.2140/pjm.1967.21.343 1967
[21]

doi: 10.1093/bioinformatics/btm098

ISSN 1367-4811, 1367-4803. doi: 10.1093/bioinformatics/btm098. Yang Tan, Lirong Zheng, Bozitao Zhong, Liang Hong, and Bingxin Zhou. Protein representation learning with sequence information embedding: Does it always lead to a better performance? In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 233–239. IEEE,

work page doi:10.1093/bioinformatics/btm098 2024
[22]

VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a

Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, and Bingxin Zhou. VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a. Yang Tan, Bingxin Zhou, Lirong Zheng, Guisheng Fan, and Liang Hong. Semantical and geo- metrical protein encoding toward enhanced bioactivity and thermostability.eLife, 13:RP98033, 2025b...

work page arXiv
[23]

doi: 10.1038/s41587-023-01773-0

ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01773-0. Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.Nucle...

work page doi:10.1038/s41587-023-01773-0
[24]

doi: 10.1093/nar/gki524

ISSN 1362-4962. doi: 10.1093/nar/gki524. 13 Preprint. A OPTIMALTRANSPORTFORMULATION FORPROTEINALIGNMENT To circumvent the computational bottleneck of explicit fragment enumeration, we reframe the align- ment problem as finding optimal correspondences between individual residues rather than pre- defined fragments. This approach leverages optimal transport ...

work page doi:10.1093/nar/gki524
[25]

performs structural alignment using 3Di tokenizations, converting 3D structural information into sequence-like representations for comparison. D.2 GLOBALEMBEDDING-BASEDALIGNMENT COSINESIMmethods employ direct cosine similarity between globally aggregated protein embed- dings from the backbone models discussed in Appendix D.4, similar to the approach used ...

work page 2024
[26]

This method performs local alignment at the residue level using learned representations

represents the current state-of-the-art in local embedding-based align- ment, combining statistical alignment with neural embeddings to identify similar substructures. This method performs local alignment at the residue level using learned representations. D.4 BACKBONES We evaluate PLASMA with seven popular protein sequence and structure representation mo...

work page 2023
[27]

sequences and provides balanced performance between computational effi- ciency and representation quality with approximately 3 billion parameters.Available at:https: //huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc • PROTSSN (Tan et al., 2025b): We utilize thek20 h512configuration, which combines sequence and structural information through a hybrid a...

work page arXiv 2024

[1] [1]

doi: 10.1371/journal.pcbi.1008502

ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008502. Matthias Blum, Antonina Andreeva, Laise Cavalcanti Florentino, Sara Rocio Chuguransky, Tiago Grego, Emma Hobbs, Beatriz Lazaro Pinto, Ailsa Orr, Typhaine Paysan-Lafosse, Irina Pona- mareva, Gustavo A Salazar, Nicola Bordin, Peer Bork, Alan Bridge, Lucy Colwell, Julian Gough, Daniel H Haft, Ivica Letunic, ...

work page doi:10.1371/journal.pcbi.1008502 2025

[2] [2]

doi: 10.1093/nar/gkae1082

ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/gkae1082. Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. ProteinBERT: A universal deep-learning model of protein sequence and function.Bioinformatics, 38(8):2102– 2110,

work page doi:10.1093/nar/gkae1082

[3] [3]

ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics, 38(8): 2102–2110, February 2022

ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btac020. Luis Caffarelli and Robert McCann. Free boundaries in optimal transport and monge-amp `ere obstacle problems.Annals of Mathematics, 171(2):673–730,

work page doi:10.1093/bioinformatics/btac020

[4] [4]

doi: 10.4007/annals.2010.171.673

ISSN 0003-486X. doi: 10.4007/annals.2010.171.673. Gaofeng Cui, Beiyan Nan, Jicheng Hu, Yiping Wang, Changwen Jin, and Bin Xia. Identification and solution structures of a single domain biotin/lipoyl attachment protein from bacillus subtilis. Journal of Biological Chemistry, 281(29):20598–20607,

work page doi:10.4007/annals.2010.171.673 2010

[5] [5]

Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang

doi: 10.5555/2999792.2999868. Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang. mTM-align: an algorithm for fast and accurate multiple protein structure alignment.Bioinformatics, 34(10):1719–1725,

work page doi:10.5555/2999792.2999868

[6] [6]

Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,

Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Char- lotte Rochereau, and Burkhard Rost. Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,

work page arXiv

[7] [7]

doi: 10.1007/s00205-008-0212-7

ISSN 0003-9527, 1432-0673. doi: 10.1007/s00205-008-0212-7. Karen R Groot, Lisa M Sevilla, Kazunori Nishi, Teresa DiColandrea, and Fiona M Watt. Kazrin, a novel periplakin-interacting protein associated with desmosomes and the keratinocyte plasma membrane.The Journal of cell biology, 166(5):653–659,

work page doi:10.1007/s00205-008-0212-7

[8] [8]

doi: 10.1038/s41587-023-01917-2

ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01917-2. 11 Preprint. Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, and Burkhard Rost. Bilingual language model for protein sequence and structure.NAR Genomics and Bioinformatics, 6(4):lqae150,

work page doi:10.1038/s41587-023-01917-2

[9] [9]

doi: 10.1093/nargab/lqae150

ISSN 2631-9268. doi: 10.1093/nargab/lqae150. Liisa Holm. Using dali for protein structure comparison. In Zolt ´an G´asp´ari (ed.),Structural Bioin- formatics, volume 2112, pp. 29–42. Springer US,

work page doi:10.1093/nargab/lqae150

[10] [10]

doi: 10.1007/978-1-0716-0270-6

ISBN 978-1-0716-0269-0 978-1-0716- 0270-6. doi: 10.1007/978-1-0716-0270-6

work page doi:10.1007/978-1-0716-0270-6

[11] [11]

Jamasb, Alex Morehead, Chaitanya K

Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V . Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, and Tom L. Blundell. Eval- uating representation learning on the protein structure universe. InICLR 2024,

work page 2024

[12] [12]

doi: 10.48550/arXiv.2406.13864. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇ Z ´ıdek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera- Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Tre...

work page doi:10.48550/arxiv.2406.13864

[13] [13]

Nature , author =

ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03819-2. Kamil Kaminski, Jan Ludwiczak, Kamil Pawlicki, Vikram Alva, and Stanislaw Dunin-Horkawicz. pLM-BLAST: Distant homology detection based on direct comparison of sequence representa- tions from protein language models.Bioinformatics, 39(10):btad579,

work page doi:10.1038/s41586-021-03819-2

[14] [14]

doi: 10.1093/bioinformatics/btad579

ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad579. Hyunbin Kim, Rachel Seongeun Kim, Milot Mirdita, and Martin Steinegger. Structural motif search across the protein-universe with folddisco.bioRxiv, pp. 2025–07,

work page doi:10.1093/bioinformatics/btad579 2025

[15] [15]

doi: 10.1186/1479-7364-4-3-207

ISSN 1479-7364. doi: 10.1186/1479-7364-4-3-207. Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan Dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic- level protein structure with a languag...

work page doi:10.1186/1479-7364-4-3-207

[16] [16]

1126/science.ade2574

ISSN 0036- 8075, 1095-9203. doi: 10.1126/science.ade2574. Wei Liu, Ziye Wang, Ronghui You, Chenghan Xie, Hong Wei, Yi Xiong, Jianyi Yang, and Shan- feng Zhu. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology.Nature Communications, 15(1):2775,

work page doi:10.1126/science.ade2574

[17] [17]

doi: 10.1038/s41467-024-46808-5

ISSN 2041-1723. doi: 10.1038/s41467-024-46808-5. Yang Liu, Qing Ye, Liwei Wang, and Jian Peng. Learning structural motif representations for efficient protein structure search.Bioinformatics, 34(17):i773–i780,

work page doi:10.1038/s41467-024-46808-5 2041

[18] [18]

doi: 10.1093/bioinformatics/btad786

ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad786. Vaibhav Raj, Indradyumna Roy, Ashwin Ramachandran, Soumen Chakrabarti, and Abir De. Chart- ing the design space of neural graph representations for subgraph matching. InICLR 2025,

work page doi:10.1093/bioinformatics/btad786 2025

[19] [19]

Itera- tively refined early interaction alignment for subgraph matching based graph retrieval

Ashwin Ramachandran, Vaibhav Raj, Indrayumna Roy, Soumen Chakrabarti, and Abir De. Itera- tively refined early interaction alignment for subgraph matching based graph retrieval. InNeurIPS 2024,

work page 2024

[20] [20]

doi: 10.2140/pjm.1967.21.343

ISSN 0030-8730, 0030-8730. doi: 10.2140/pjm.1967.21.343. Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H. Wu. UniRef: Comprehensive and non-redundant UniProt reference clusters.Bioinformatics, 23(10):1282– 1288,

work page doi:10.2140/pjm.1967.21.343 1967

[21] [21]

doi: 10.1093/bioinformatics/btm098

ISSN 1367-4811, 1367-4803. doi: 10.1093/bioinformatics/btm098. Yang Tan, Lirong Zheng, Bozitao Zhong, Liang Hong, and Bingxin Zhou. Protein representation learning with sequence information embedding: Does it always lead to a better performance? In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 233–239. IEEE,

work page doi:10.1093/bioinformatics/btm098 2024

[22] [22]

VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a

Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, and Bingxin Zhou. VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a. Yang Tan, Bingxin Zhou, Lirong Zheng, Guisheng Fan, and Liang Hong. Semantical and geo- metrical protein encoding toward enhanced bioactivity and thermostability.eLife, 13:RP98033, 2025b...

work page arXiv

[23] [23]

doi: 10.1038/s41587-023-01773-0

ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01773-0. Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.Nucle...

work page doi:10.1038/s41587-023-01773-0

[24] [24]

doi: 10.1093/nar/gki524

ISSN 1362-4962. doi: 10.1093/nar/gki524. 13 Preprint. A OPTIMALTRANSPORTFORMULATION FORPROTEINALIGNMENT To circumvent the computational bottleneck of explicit fragment enumeration, we reframe the align- ment problem as finding optimal correspondences between individual residues rather than pre- defined fragments. This approach leverages optimal transport ...

work page doi:10.1093/nar/gki524

[25] [25]

performs structural alignment using 3Di tokenizations, converting 3D structural information into sequence-like representations for comparison. D.2 GLOBALEMBEDDING-BASEDALIGNMENT COSINESIMmethods employ direct cosine similarity between globally aggregated protein embed- dings from the backbone models discussed in Appendix D.4, similar to the approach used ...

work page 2024

[26] [26]

This method performs local alignment at the residue level using learned representations

represents the current state-of-the-art in local embedding-based align- ment, combining statistical alignment with neural embeddings to identify similar substructures. This method performs local alignment at the residue level using learned representations. D.4 BACKBONES We evaluate PLASMA with seven popular protein sequence and structure representation mo...

work page 2023

[27] [27]

sequences and provides balanced performance between computational effi- ciency and representation quality with approximately 3 billion parameters.Available at:https: //huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc • PROTSSN (Tan et al., 2025b): We utilize thek20 h512configuration, which combines sequence and structural information through a hybrid a...

work page arXiv 2024