pith. sign in

arxiv: 2510.11752 · v2 · submitted 2025-10-12 · 🧬 q-bio.QM · cs.AI· cs.LG

Fast and Interpretable Protein Substructure Alignment via Optimal Transport

Pith reviewed 2026-05-18 08:24 UTC · model grok-4.3

classification 🧬 q-bio.QM cs.AIcs.LG
keywords protein structure alignmentoptimal transportlocal structural motifsresidue-level alignmentSinkhorn iterationsinterpretable alignmentfunctional annotation
0
0 comments X

The pith

Protein local structural alignment can be recast as regularized optimal transport to produce fast, accurate, and interpretable residue-level matches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that local protein motifs such as active sites can be compared by turning the alignment task into a regularized optimal transport problem solved with differentiable Sinkhorn iterations. For any pair of protein structures the method returns an explicit alignment matrix together with an overall similarity score. Extensive tests and three biological case studies indicate that the resulting alignments are both computationally light and biologically useful. A training-free variant is supplied for settings where labeled data are unavailable. The approach therefore fills a practical gap in tools that link protein structure to function, evolution, and engineering.

Core claim

PLASMA reformulates residue-level local structural alignment as a regularized optimal transport problem and solves it with differentiable Sinkhorn iterations, yielding an explicit alignment matrix and an interpretable similarity score for any input pair of protein structures.

What carries the argument

regularized optimal transport task solved by differentiable Sinkhorn iterations that produces the alignment matrix

If this is right

  • Residue-level alignments become directly usable for functional annotation of local motifs.
  • The method supplies a lightweight alternative to heavier existing alignment tools for large protein sets.
  • The explicit alignment matrix supports downstream evolutionary and engineering analyses.
  • The training-free PLASMA-PF variant extends applicability to data-scarce settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same transport formulation could be tested on nucleic-acid or ligand substructures to check generality beyond proteins.
  • The interpretability of the alignment matrix may allow direct comparison against experimental cross-linking or mutagenesis data.
  • If the cost function is further tuned, the method might scale to proteome-wide motif searches.

Load-bearing premise

The reformulation of local structural alignment as a regularized optimal transport task with a suitable cost function will produce alignments that are biologically meaningful rather than merely mathematically convenient.

What would settle it

If the alignments returned by PLASMA on the three reported biological case studies fail to recover known functional residues or active-site correspondences, the claim of biological utility is falsified.

Figures

Figures reproduced from arXiv: 2510.11752 by Bingxin Zhou, Jing Wang, Liang Hong, Pietro Li\`o, Weishu Zhao, Yang Tan, Zhiyu Wang.

Figure 1
Figure 1. Figure 1: PLASMA Overview. PLASMA converts residue-level protein embeddings into substruc [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance versus computational efficiency comparison. ROC-AUC scores plotted [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Alignment quality analysis across four different embedding-based approaches. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Label Match Score comparison between PLASMA and PLASMA-PF across different [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative alignment examples across three protein pairs. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Representative alignment matrices comparing query protein P76129 against six candidate [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of Sinkhorn temperature parameter [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance vs dataset fraction. PLASMA demonstrates high performance in predicting [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance vs hidden dimension size of the siamese network. While PLASMA’s per [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Performance vs Sinkhorn temperature (τ ). PLASMA’s performance remains stably high within the 0.1–1 range, but when out of this range, PLASMA’s performance noticeably drops. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance vs number of Sinkhorn iterations [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance vs the kernel size of the diagonal convolution ( [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance vs residue matching threshold ( [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Alignment matrix visualizations of random positive pairs from [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Alignment matrix visualizations of random positive pairs from [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Alignment matrix visualizations of random positive pairs from [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
read the original abstract

Proteins are essential biological macromolecules that execute life functions. Local structural motifs, such as active sites, are the most critical components for linking structure to function and are key to understanding protein evolution and enabling protein engineering. Existing computational methods struggle to identify and compare these local structures, which leaves a significant gap in understanding protein structures and harnessing their functions. This study presents PLASMA, a deep-learning-based framework for efficient and interpretable residue-level local structural alignment. We reformulate the problem as a regularized optimal transport task and leverage differentiable Sinkhorn iterations. For a pair of input protein structures, PLASMA outputs a clear alignment matrix with an interpretable overall similarity score. Through extensive quantitative evaluations and three biological case studies, we demonstrate that PLASMA achieves accurate, lightweight, and interpretable residue-level alignment. Additionally, we introduce PLASMA-PF, a training-free variant that provides a practical alternative when training data are unavailable. Our method addresses a critical gap in protein structure analysis tools and offers new opportunities for functional annotation, evolutionary studies, and structure-based drug design. Reproducibility is ensured via our official implementation at https://github.com/ZW471/PLASMA-Protein-Local-Alignment.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PLASMA, a deep-learning framework that reformulates local structural alignment of protein substructures as a regularized optimal transport problem solved via differentiable Sinkhorn iterations. For input protein structure pairs it produces an interpretable residue-level alignment matrix together with an overall similarity score. A training-free variant (PLASMA-PF) is also presented. The authors support the claims of accuracy, lightness and interpretability with quantitative evaluations and three biological case studies, positioning the method for functional annotation, evolutionary analysis and structure-based drug design. Reproducibility is asserted via a public GitHub repository.

Significance. If the central claims hold, the work could supply a lightweight, interpretable tool for comparing local protein motifs that existing methods handle poorly. The explicit alignment matrix and the availability of both learned and parameter-light variants are practical strengths. Open code further supports verification and reuse. Significance ultimately hinges on whether the OT-derived alignments recover functionally relevant residues rather than purely geometric correspondences.

major comments (2)
  1. [Abstract] Abstract: the assertion of 'accurate' residue-level alignments supported by 'extensive quantitative evaluations' is load-bearing for the central claim, yet the manuscript provides insufficient detail on data splits, baseline selection, and error analysis. Without these elements it is impossible to rule out post-hoc selection or to confirm that the reported metrics genuinely validate the OT formulation.
  2. [Methods] Methods (regularized OT formulation and cost function): the claim that the Sinkhorn-derived transport plan yields biologically meaningful alignments rests on the unverified assumption that high-mass entries correspond to functional or conserved residues rather than local geometric or embedding similarities. This assumption is central to the biological utility asserted in the case studies and requires explicit validation (e.g., overlap with known active-site residues or motif databases) to be load-bearing.
minor comments (1)
  1. [Implementation and Reproducibility] The GitHub repository is cited for reproducibility, but the manuscript should explicitly list which data splits, trained weights, and evaluation scripts are included so that the quantitative results can be regenerated without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us improve the clarity and rigor of the manuscript. We address each major comment point by point below. Revisions have been made to provide additional details on the evaluation protocol and to include explicit quantitative validation of the biological relevance of the alignments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertion of 'accurate' residue-level alignments supported by 'extensive quantitative evaluations' is load-bearing for the central claim, yet the manuscript provides insufficient detail on data splits, baseline selection, and error analysis. Without these elements it is impossible to rule out post-hoc selection or to confirm that the reported metrics genuinely validate the OT formulation.

    Authors: We agree that greater transparency on the evaluation setup is warranted. In the revised manuscript we have updated the abstract for precision and expanded Section 3 (Methods) and the Supplementary Information to specify: (i) data splits performed at the protein level with a 30% sequence-identity cutoff to prevent leakage, (ii) baseline selection rationale (TM-align, US-align, Foldseek, and a simple Euclidean-distance OT variant), and (iii) full error analysis including standard deviations over five random seeds and per-residue precision-recall curves. These additions demonstrate that performance gains are reproducible on held-out sets and arise from the regularized OT objective rather than selective reporting. revision: yes

  2. Referee: [Methods] Methods (regularized OT formulation and cost function): the claim that the Sinkhorn-derived transport plan yields biologically meaningful alignments rests on the unverified assumption that high-mass entries correspond to functional or conserved residues rather than local geometric or embedding similarities. This assumption is central to the biological utility asserted in the case studies and requires explicit validation (e.g., overlap with known active-site residues or motif databases) to be load-bearing.

    Authors: We acknowledge that an explicit quantitative link between transport mass and functional annotation strengthens the central claim. The three case studies already illustrate recovery of catalytic residues, but to address the concern directly we have added a new analysis in the Results section that computes the overlap between high-mass entries (top 10% of the transport plan) and annotated active-site residues from the Catalytic Site Atlas as well as PROSITE motifs. The revised text reports statistically significant enrichment (hypergeometric p < 0.01) relative to both random assignment and a purely geometric baseline, supporting that the learned cost function and Sinkhorn plan capture functional correspondences beyond local geometry alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; standard OT reformulation applied to protein substructures with independent benchmarks

full rationale

The paper reformulates local structural alignment as regularized optimal transport solved via differentiable Sinkhorn iterations, which is a direct application of established mathematical tools rather than a self-referential derivation. The deep-learning component extracts residue features or defines costs, but performance claims rest on quantitative evaluations against external benchmarks and three biological case studies, not on renaming fitted parameters as predictions. The training-free PLASMA-PF variant further decouples results from learned weights. No load-bearing step reduces by construction to the paper's own inputs or self-citations; the central claims remain externally falsifiable via standard alignment metrics and functional annotations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that structural similarity can be faithfully captured by an optimal transport cost matrix and on a small number of regularization and learning hyperparameters whose values are not independently derived.

free parameters (1)
  • regularization strength
    Controls the smoothness of the transport plan and must be chosen or tuned for the protein alignment task.
axioms (1)
  • domain assumption A cost function based on local structural features can be defined such that optimal transport yields biologically relevant residue alignments.
    Invoked when the problem is reformulated as regularized optimal transport.

pith-pipeline@v0.9.0 · 5766 in / 1216 out tokens · 32220 ms · 2026-05-18T08:24:31.804685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    doi: 10.1371/journal.pcbi.1008502

    ISSN 1553-7358. doi: 10.1371/journal.pcbi.1008502. Matthias Blum, Antonina Andreeva, Laise Cavalcanti Florentino, Sara Rocio Chuguransky, Tiago Grego, Emma Hobbs, Beatriz Lazaro Pinto, Ailsa Orr, Typhaine Paysan-Lafosse, Irina Pona- mareva, Gustavo A Salazar, Nicola Bordin, Peer Bork, Alan Bridge, Lucy Colwell, Julian Gough, Daniel H Haft, Ivica Letunic, ...

  2. [2]

    doi: 10.1093/nar/gkae1082

    ISSN 0305-1048, 1362-4962. doi: 10.1093/nar/gkae1082. Nadav Brandes, Dan Ofer, Yam Peleg, Nadav Rappoport, and Michal Linial. ProteinBERT: A universal deep-learning model of protein sequence and function.Bioinformatics, 38(8):2102– 2110,

  3. [3]

    ProteinBERT: a universal deep-learning model of protein sequence and function.Bioinformatics, 38(8): 2102–2110, February 2022

    ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btac020. Luis Caffarelli and Robert McCann. Free boundaries in optimal transport and monge-amp `ere obstacle problems.Annals of Mathematics, 171(2):673–730,

  4. [4]

    doi: 10.4007/annals.2010.171.673

    ISSN 0003-486X. doi: 10.4007/annals.2010.171.673. Gaofeng Cui, Beiyan Nan, Jicheng Hu, Yiping Wang, Changwen Jin, and Bin Xia. Identification and solution structures of a single domain biotin/lipoyl attachment protein from bacillus subtilis. Journal of Biological Chemistry, 281(29):20598–20607,

  5. [5]

    Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang

    doi: 10.5555/2999792.2999868. Runze Dong, Zhenling Peng, Yang Zhang, and Jianyi Yang. mTM-align: an algorithm for fast and accurate multiple protein structure alignment.Bioinformatics, 34(10):1719–1725,

  6. [6]

    Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,

    Ahmed Elnaggar, Hazem Essam, Wafaa Salah-Eldin, Walid Moustafa, Mohamed Elkerdawy, Char- lotte Rochereau, and Burkhard Rost. Ankh: Optimized protein language model unlocks general- purpose modelling.arXiv:2301.06568,

  7. [7]

    doi: 10.1007/s00205-008-0212-7

    ISSN 0003-9527, 1432-0673. doi: 10.1007/s00205-008-0212-7. Karen R Groot, Lisa M Sevilla, Kazunori Nishi, Teresa DiColandrea, and Fiona M Watt. Kazrin, a novel periplakin-interacting protein associated with desmosomes and the keratinocyte plasma membrane.The Journal of cell biology, 166(5):653–659,

  8. [8]

    doi: 10.1038/s41587-023-01917-2

    ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01917-2. 11 Preprint. Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Milot Mirdita, Martin Steinegger, and Burkhard Rost. Bilingual language model for protein sequence and structure.NAR Genomics and Bioinformatics, 6(4):lqae150,

  9. [9]

    doi: 10.1093/nargab/lqae150

    ISSN 2631-9268. doi: 10.1093/nargab/lqae150. Liisa Holm. Using dali for protein structure comparison. In Zolt ´an G´asp´ari (ed.),Structural Bioin- formatics, volume 2112, pp. 29–42. Springer US,

  10. [10]

    doi: 10.1007/978-1-0716-0270-6

    ISBN 978-1-0716-0269-0 978-1-0716- 0270-6. doi: 10.1007/978-1-0716-0270-6

  11. [11]

    Jamasb, Alex Morehead, Chaitanya K

    Arian R. Jamasb, Alex Morehead, Chaitanya K. Joshi, Zuobai Zhang, Kieran Didi, Simon V . Mathis, Charles Harris, Jian Tang, Jianlin Cheng, Pietro Lio, and Tom L. Blundell. Eval- uating representation learning on the protein structure universe. InICLR 2024,

  12. [12]

    doi: 10.48550/arXiv.2406.13864. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin ˇ Z ´ıdek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera- Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Tre...

  13. [13]

    Nature , author =

    ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-021-03819-2. Kamil Kaminski, Jan Ludwiczak, Kamil Pawlicki, Vikram Alva, and Stanislaw Dunin-Horkawicz. pLM-BLAST: Distant homology detection based on direct comparison of sequence representa- tions from protein language models.Bioinformatics, 39(10):btad579,

  14. [14]

    doi: 10.1093/bioinformatics/btad579

    ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad579. Hyunbin Kim, Rachel Seongeun Kim, Milot Mirdita, and Martin Steinegger. Structural motif search across the protein-universe with folddisco.bioRxiv, pp. 2025–07,

  15. [15]

    doi: 10.1186/1479-7364-4-3-207

    ISSN 1479-7364. doi: 10.1186/1479-7364-4-3-207. Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, Allan Dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, and Alexander Rives. Evolutionary-scale prediction of atomic- level protein structure with a languag...

  16. [16]

    1126/science.ade2574

    ISSN 0036- 8075, 1095-9203. doi: 10.1126/science.ade2574. Wei Liu, Ziye Wang, Ronghui You, Chenghan Xie, Hong Wei, Yi Xiong, Jianyi Yang, and Shan- feng Zhu. PLMSearch: Protein language model powers accurate and fast sequence search for remote homology.Nature Communications, 15(1):2775,

  17. [17]

    doi: 10.1038/s41467-024-46808-5

    ISSN 2041-1723. doi: 10.1038/s41467-024-46808-5. Yang Liu, Qing Ye, Liwei Wang, and Jian Peng. Learning structural motif representations for efficient protein structure search.Bioinformatics, 34(17):i773–i780,

  18. [18]

    doi: 10.1093/bioinformatics/btad786

    ISSN 1367-4803, 1367-4811. doi: 10.1093/bioinformatics/btad786. Vaibhav Raj, Indradyumna Roy, Ashwin Ramachandran, Soumen Chakrabarti, and Abir De. Chart- ing the design space of neural graph representations for subgraph matching. InICLR 2025,

  19. [19]

    Itera- tively refined early interaction alignment for subgraph matching based graph retrieval

    Ashwin Ramachandran, Vaibhav Raj, Indrayumna Roy, Soumen Chakrabarti, and Abir De. Itera- tively refined early interaction alignment for subgraph matching based graph retrieval. InNeurIPS 2024,

  20. [20]

    doi: 10.2140/pjm.1967.21.343

    ISSN 0030-8730, 0030-8730. doi: 10.2140/pjm.1967.21.343. Baris E. Suzek, Hongzhan Huang, Peter McGarvey, Raja Mazumder, and Cathy H. Wu. UniRef: Comprehensive and non-redundant UniProt reference clusters.Bioinformatics, 23(10):1282– 1288,

  21. [21]

    doi: 10.1093/bioinformatics/btm098

    ISSN 1367-4811, 1367-4803. doi: 10.1093/bioinformatics/btm098. Yang Tan, Lirong Zheng, Bozitao Zhong, Liang Hong, and Bingxin Zhou. Protein representation learning with sequence information embedding: Does it always lead to a better performance? In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 233–239. IEEE,

  22. [22]

    VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a

    Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, and Bingxin Zhou. VenusX: Unlocking fine-grained functional understanding of proteins.arXiv:2505.11812, 2025a. Yang Tan, Bingxin Zhou, Lirong Zheng, Guisheng Fan, and Liang Hong. Semantical and geo- metrical protein encoding toward enhanced bioactivity and thermostability.eLife, 13:RP98033, 2025b...

  23. [23]

    doi: 10.1038/s41587-023-01773-0

    ISSN 1087-0156, 1546-1696. doi: 10.1038/s41587-023-01773-0. Mihaly Varadi, Stephen Anyango, Mandar Deshpande, Sreenath Nair, Cindy Natassia, Galabina Yordanova, David Yuan, Oana Stroe, Gemma Wood, Agata Laydon, et al. Alphafold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.Nucle...

  24. [24]

    doi: 10.1093/nar/gki524

    ISSN 1362-4962. doi: 10.1093/nar/gki524. 13 Preprint. A OPTIMALTRANSPORTFORMULATION FORPROTEINALIGNMENT To circumvent the computational bottleneck of explicit fragment enumeration, we reframe the align- ment problem as finding optimal correspondences between individual residues rather than pre- defined fragments. This approach leverages optimal transport ...

  25. [25]

    performs structural alignment using 3Di tokenizations, converting 3D structural information into sequence-like representations for comparison. D.2 GLOBALEMBEDDING-BASEDALIGNMENT COSINESIMmethods employ direct cosine similarity between globally aggregated protein embed- dings from the backbone models discussed in Appendix D.4, similar to the approach used ...

  26. [26]

    This method performs local alignment at the residue level using learned representations

    represents the current state-of-the-art in local embedding-based align- ment, combining statistical alignment with neural embeddings to identify similar substructures. This method performs local alignment at the residue level using learned representations. D.4 BACKBONES We evaluate PLASMA with seven popular protein sequence and structure representation mo...

  27. [27]

    sequences and provides balanced performance between computational effi- ciency and representation quality with approximately 3 billion parameters.Available at:https: //huggingface.co/Rostlab/prot_t5_xl_half_uniref50-enc • PROTSSN (Tan et al., 2025b): We utilize thek20 h512configuration, which combines sequence and structural information through a hybrid a...