Fast and Interpretable Protein Substructure Alignment via Optimal Transport
Pith reviewed 2026-05-18 08:24 UTC · model grok-4.3
The pith
Protein local structural alignment can be recast as regularized optimal transport to produce fast, accurate, and interpretable residue-level matches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PLASMA reformulates residue-level local structural alignment as a regularized optimal transport problem and solves it with differentiable Sinkhorn iterations, yielding an explicit alignment matrix and an interpretable similarity score for any input pair of protein structures.
What carries the argument
regularized optimal transport task solved by differentiable Sinkhorn iterations that produces the alignment matrix
If this is right
- Residue-level alignments become directly usable for functional annotation of local motifs.
- The method supplies a lightweight alternative to heavier existing alignment tools for large protein sets.
- The explicit alignment matrix supports downstream evolutionary and engineering analyses.
- The training-free PLASMA-PF variant extends applicability to data-scarce settings.
Where Pith is reading between the lines
- The same transport formulation could be tested on nucleic-acid or ligand substructures to check generality beyond proteins.
- The interpretability of the alignment matrix may allow direct comparison against experimental cross-linking or mutagenesis data.
- If the cost function is further tuned, the method might scale to proteome-wide motif searches.
Load-bearing premise
The reformulation of local structural alignment as a regularized optimal transport task with a suitable cost function will produce alignments that are biologically meaningful rather than merely mathematically convenient.
What would settle it
If the alignments returned by PLASMA on the three reported biological case studies fail to recover known functional residues or active-site correspondences, the claim of biological utility is falsified.
Figures
read the original abstract
Proteins are essential biological macromolecules that execute life functions. Local structural motifs, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, a deep-learning-based framework for efficient and interpretable residue-level local structural alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at https://github.com/ZW471/PLASMA-Protein-Local-Alignment.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PLASMA, a deep-learning framework that reformulates local structural alignment of protein substructures as a regularized optimal transport problem solved via differentiable Sinkhorn iterations. For input protein structure pairs it produces an interpretable residue-level alignment matrix together with an overall similarity score. A training-free variant (PLASMA-PF) is also presented. The authors support the claims of accuracy, lightness and interpretability with quantitative evaluations and three biological case studies, positioning the method for functional annotation, evolutionary analysis and structure-based drug design. Reproducibility is asserted via a public GitHub repository.
Significance. If the central claims hold, the work could supply a lightweight, interpretable tool for comparing local protein motifs that existing methods handle poorly. The explicit alignment matrix and the availability of both learned and parameter-light variants are practical strengths. Open code further supports verification and reuse. Significance ultimately hinges on whether the OT-derived alignments recover functionally relevant residues rather than purely geometric correspondences.
major comments (2)
- [Abstract] Abstract: the assertion of 'accurate' residue-level alignments supported by 'extensive quantitative evaluations' is load-bearing for the central claim, yet the manuscript provides insufficient detail on data splits, baseline selection, and error analysis. Without these elements it is impossible to rule out post-hoc selection or to confirm that the reported metrics genuinely validate the OT formulation.
- [Methods] Methods (regularized OT formulation and cost function): the claim that the Sinkhorn-derived transport plan yields biologically meaningful alignments rests on the unverified assumption that high-mass entries correspond to functional or conserved residues rather than local geometric or embedding similarities. This assumption is central to the biological utility asserted in the case studies and requires explicit validation (e.g., overlap with known active-site residues or motif databases) to be load-bearing.
minor comments (1)
- [Implementation and Reproducibility] The GitHub repository is cited for reproducibility, but the manuscript should explicitly list which data splits, trained weights, and evaluation scripts are included so that the quantitative results can be regenerated without ambiguity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below. Revisions have been made to provide additional details on the evaluation protocol and to include explicit quantitative validation of the biological relevance of the alignments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion of 'accurate' residue-level alignments supported by 'extensive quantitative evaluations' is load-bearing for the central claim, yet the manuscript provides insufficient detail on data splits, baseline selection, and error analysis. Without these elements it is impossible to rule out post-hoc selection or to confirm that the reported metrics genuinely validate the OT formulation.
Authors: We agree that greater transparency on the evaluation setup is warranted. In the revised manuscript we have updated the abstract for precision and expanded Section 3 (Methods) and the Supplementary Information to specify: (i) data splits performed at the protein level with a 30% sequence-identity cutoff to prevent leakage, (ii) baseline selection rationale (TM-align, US-align, Foldseek, and a simple Euclidean-distance OT variant), and (iii) full error analysis including standard deviations over five random seeds and per-residue precision-recall curves. These additions demonstrate that performance gains are reproducible on held-out sets and arise from the regularized OT objective rather than selective reporting. revision: yes
-
Referee: [Methods] Methods (regularized OT formulation and cost function): the claim that the Sinkhorn-derived transport plan yields biologically meaningful alignments rests on the unverified assumption that high-mass entries correspond to functional or conserved residues rather than local geometric or embedding similarities. This assumption is central to the biological utility asserted in the case studies and requires explicit validation (e.g., overlap with known active-site residues or motif databases) to be load-bearing.
Authors: We acknowledge that an explicit quantitative link between transport mass and functional annotation strengthens the central claim. The three case studies already illustrate recovery of catalytic residues, but to address the concern directly we have added a new analysis in the Results section that computes the overlap between high-mass entries (top 10% of the transport plan) and annotated active-site residues from the Catalytic Site Atlas as well as PROSITE motifs. The revised text reports statistically significant enrichment (hypergeometric p < 0.01) relative to both random assignment and a purely geometric baseline, supporting that the learned cost function and Sinkhorn plan capture functional correspondences beyond local geometry alone. revision: yes
Circularity Check
No significant circularity; standard OT reformulation applied to protein substructures with independent benchmarks
full rationale
The paper reformulates local structural alignment as regularized optimal transport solved via differentiable Sinkhorn iterations, which is a direct application of established mathematical tools rather than a self-referential derivation. The deep-learning component extracts residue features or defines costs, but performance claims rest on quantitative evaluations against external benchmarks and three biological case studies, not on renaming fitted parameters as predictions. The training-free PLASMA-PF variant further decouples results from learned weights. No load-bearing step reduces by construction to the paper's own inputs or self-citations; the central claims remain externally falsifiable via standard alignment metrics and functional annotations.
Axiom & Free-Parameter Ledger
free parameters (1)
- regularization strength
axioms (1)
- domain assumption A cost function based on local structural features can be defined such that optimal transport yields biologically relevant residue alignments.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations... Cij = ||[ϕθ(LN(hq,i))−ϕθ(LN(hc,j))]+||1
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The substructure similarity score s is defined as the cosine similarity between the summed representations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
doi: 10.1371/journal.pcbi.1008502
ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008502. Matthias Blum, Antonina Andreeva, Laise Cavalcanti Florentino, Sara Rocio Chuguransky, Tiago Grego, Emma Hobbs, Beatriz Lazaro Pinto, Ailsa Orr, Typhaine Paysan-Lafosse, Irina Pona- mareva, Gustavo A Salazar, Nicola Bordin, Peer Bork, Alan Bridge, Lucy Colwell, Julian Gough, Daniel H Haft, Ivica Letunic, ...
-
[2]
ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/gkae1082. Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. ProteinBERT: A universal deep-learning model of protein sequence and function.Bioinformatics, 38(8):2102– 2110,
-
[3]
ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btac020. Luis Caffarelli and Robert McCann. Free boundaries in optimal transport and monge-amp `ere obstacle problems.Annals of Mathematics, 171(2):673–730,
-
[4]
doi: 10.4007/annals.2010.171.673
ISSN 0003-486X. doi: 10.4007/annals.2010.171.673. Gaofeng Cui, Beiyan Nan, Jicheng Hu, Yiping Wang, Changwen Jin, and Bin Xia. Identification and solution structures of a single domain biotin/lipoyl attachment protein from bacillus subtilis. Journal of Biological Chemistry, 281(29):20598–20607,
-
[5]
Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang
doi: 10.5555/2999792.2999868. Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang. mTM-align: an algorithm for fast and accurate multiple protein structure alignment.Bioinformatics, 34(10):1719–1725,
-
[6]
Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,
Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Char- lotte Rochereau, and Burkhard Rost. Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,
-
[7]
doi: 10.1007/s00205-008-0212-7
ISSN 0003-9527, 1432-0673. doi: 10.1007/s00205-008-0212-7. Karen R Groot, Lisa M Sevilla, Kazunori Nishi, Teresa DiColandrea, and Fiona M Watt. Kazrin, a novel periplakin-interacting protein associated with desmosomes and the keratinocyte plasma membrane.The Journal of cell biology, 166(5):653–659,
-
[8]
doi: 10.1038/s41587-023-01917-2
ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01917-2. 11 Preprint. Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, and Burkhard Rost. Bilingual language model for protein sequence and structure.NAR Genomics and Bioinformatics, 6(4):lqae150,
-
[9]
ISSN 2631-9268. doi: 10.1093/nargab/lqae150. Liisa Holm. Using dali for protein structure comparison. In Zolt ´an G´asp´ari (ed.),Structural Bioin- formatics, volume 2112, pp. 29–42. Springer US,
-
[10]
doi: 10.1007/978-1-0716-0270-6
ISBN 978-1-0716-0269-0 978-1-0716- 0270-6. doi: 10.1007/978-1-0716-0270-6
-
[11]
Jamasb, Alex Morehead, Chaitanya K
Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V . Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, and Tom L. Blundell. Eval- uating representation learning on the protein structure universe. InICLR 2024,
work page 2024
-
[12]
doi: 10.48550/arXiv.2406.13864. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇ Z ´ıdek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera- Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Tre...
-
[13]
ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03819-2. Kamil Kaminski, Jan Ludwiczak, Kamil Pawlicki, Vikram Alva, and Stanislaw Dunin-Horkawicz. pLM-BLAST: Distant homology detection based on direct comparison of sequence representa- tions from protein language models.Bioinformatics, 39(10):btad579,
-
[14]
doi: 10.1093/bioinformatics/btad579
ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad579. Hyunbin Kim, Rachel Seongeun Kim, Milot Mirdita, and Martin Steinegger. Structural motif search across the protein-universe with folddisco.bioRxiv, pp. 2025–07,
-
[15]
doi: 10.1186/1479-7364-4-3-207
ISSN 1479-7364. doi: 10.1186/1479-7364-4-3-207. Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan Dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic- level protein structure with a languag...
-
[16]
ISSN 0036- 8075, 1095-9203. doi: 10.1126/science.ade2574. Wei Liu, Ziye Wang, Ronghui You, Chenghan Xie, Hong Wei, Yi Xiong, Jianyi Yang, and Shan- feng Zhu. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology.Nature Communications, 15(1):2775,
-
[17]
doi: 10.1038/s41467-024-46808-5
ISSN 2041-1723. doi: 10.1038/s41467-024-46808-5. Yang Liu, Qing Ye, Liwei Wang, and Jian Peng. Learning structural motif representations for efficient protein structure search.Bioinformatics, 34(17):i773–i780,
-
[18]
doi: 10.1093/bioinformatics/btad786
ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad786. Vaibhav Raj, Indradyumna Roy, Ashwin Ramachandran, Soumen Chakrabarti, and Abir De. Chart- ing the design space of neural graph representations for subgraph matching. InICLR 2025,
-
[19]
Itera- tively refined early interaction alignment for subgraph matching based graph retrieval
Ashwin Ramachandran, Vaibhav Raj, Indrayumna Roy, Soumen Chakrabarti, and Abir De. Itera- tively refined early interaction alignment for subgraph matching based graph retrieval. InNeurIPS 2024,
work page 2024
-
[20]
ISSN 0030-8730, 0030-8730. doi: 10.2140/pjm.1967.21.343. Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H. Wu. UniRef: Comprehensive and non-redundant UniProt reference clusters.Bioinformatics, 23(10):1282– 1288,
-
[21]
doi: 10.1093/bioinformatics/btm098
ISSN 1367-4811, 1367-4803. doi: 10.1093/bioinformatics/btm098. Yang Tan, Lirong Zheng, Bozitao Zhong, Liang Hong, and Bingxin Zhou. Protein representation learning with sequence information embedding: Does it always lead to a better performance? In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 233–239. IEEE,
-
[22]
VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a
Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, and Bingxin Zhou. VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a. Yang Tan, Bingxin Zhou, Lirong Zheng, Guisheng Fan, and Liang Hong. Semantical and geo- metrical protein encoding toward enhanced bioactivity and thermostability.eLife, 13:RP98033, 2025b...
-
[23]
doi: 10.1038/s41587-023-01773-0
ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01773-0. Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.Nucle...
-
[24]
ISSN 1362-4962. doi: 10.1093/nar/gki524. 13 Preprint. A OPTIMALTRANSPORTFORMULATION FORPROTEINALIGNMENT To circumvent the computational bottleneck of explicit fragment enumeration, we reframe the align- ment problem as finding optimal correspondences between individual residues rather than pre- defined fragments. This approach leverages optimal transport ...
-
[25]
performs structural alignment using 3Di tokenizations, converting 3D structural information into sequence-like representations for comparison. D.2 GLOBALEMBEDDING-BASEDALIGNMENT COSINESIMmethods employ direct cosine similarity between globally aggregated protein embed- dings from the backbone models discussed in Appendix D.4, similar to the approach used ...
work page 2024
-
[26]
This method performs local alignment at the residue level using learned representations
represents the current state-of-the-art in local embedding-based align- ment, combining statistical alignment with neural embeddings to identify similar substructures. This method performs local alignment at the residue level using learned representations. D.4 BACKBONES We evaluate PLASMA with seven popular protein sequence and structure representation mo...
work page 2023
-
[27]
sequences and provides balanced performance between computational effi- ciency and representation quality with approximately 3 billion parameters.Available at:https: //huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc • PROTSSN (Tan et al., 2025b): We utilize thek20 h512configuration, which combines sequence and structural information through a hybrid a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.