pith. machine review for the scientific record. sign in

arxiv: 2605.08924 · v1 · submitted 2026-05-09 · 💻 cs.CE

Recognition: no theorem link

PPI2Text: Captioning Protein-Protein Interactions with Coordinate-Aligned Pair-Map Decoding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:56 UTC · model grok-4.3

classification 💻 cs.CE
keywords protein-protein interactionsfree-text generationmultimodal language modelsamino acid sequencespair-map decodingpositional encodingfactuality evaluationbiological databases
0
0 comments X

The pith

A multimodal model generates free-text descriptions of protein-protein interactions from amino acid sequences alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts protein-protein interaction modeling from binary classification to free-form text generation so that outputs can capture nuanced details and link more easily to existing literature. It builds PPI2Text by feeding two sequences through an ESM3 encoder, forming a complete residue-pair interaction grid, and decoding that grid into natural language with a Qwen3 decoder equipped with coordinate-aligned positional encoding. This setup is trained on a newly released 351k-pair corpus of synthesized descriptions drawn from ten biological databases. The resulting model records higher scores than baselines on standard language metrics and on an LLM-based factuality check that compares outputs directly to raw evidence.

Core claim

PPI2Text encodes each protein with an ESM3 encoder, builds a pair map across all residue pairs to represent interactions, and autoregressively produces free-text descriptions with a Qwen3 decoder; a coordinate-aligned positional encoding (PaCo-RoPE) ensures each axis of the pair grid matches the residue positions of the corresponding protein. Trained on the 351k-pair PPI2Text-Dataset, the model surpasses strong baselines on linguistic metrics against synthesized references and on factuality metrics where an LLM judge scores outputs against raw biological evidence.

What carries the argument

The coordinate-aligned pair map, which represents every possible residue pair between two proteins in a grid and uses PaCo-RoPE to align positional information along each protein's residue axis before decoding into text.

If this is right

  • Free-text PPI descriptions support richer biological detail and direct integration with literature knowledge bases compared with controlled-vocabulary labels.
  • The model records higher linguistic metric scores against synthesized references and higher factuality scores when an LLM judge compares outputs to raw biological evidence.
  • Ablation results indicate that both the full pair-map construction and the coordinate-aligned positional encoding contribute measurably to performance.
  • The released 351k-pair dataset enables further training and benchmarking of text-based PPI models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Automated generation of interaction summaries could assist literature curation or hypothesis generation in systems biology.
  • The same pair-map approach might extend to modeling interactions within larger protein complexes once suitable training data are available.
  • Independent wet-lab validation on interactions absent from the training synthesis would test whether the Gemini-generated references introduce systematic biases.

Load-bearing premise

The 351k-pair dataset of descriptions synthesized by Gemini from curated databases supplies reliable and unbiased ground truth for both model training and factuality evaluation.

What would settle it

Evaluating the model's generated descriptions against a set of experimentally verified interactions drawn from primary literature that was never used in the Gemini synthesis step, and measuring systematic mismatches in reported binding details or functional effects.

Figures

Figures reproduced from arXiv: 2605.08924 by Achilleas Tsortos, Costas Bouyioukos, Lawrence P. Petalidis, Michalis Vazirgiannis, Sarah Almeida Carneiro, Xiao Fei, Yang Zhang.

Figure 1
Figure 1. Figure 1: An example of the dataset generation (interaction between HEXIM1 and CDK9). The top [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model architecture of PPI2Text. (a) In dual-stream joint encoder, both proteins are first separately encoded and compressed in parallel then jointly encoded to construct a PairMap, modeling interactions between pairs of residues. (b) For multimodal language decoding, both single-protein representations and pair map tokens from the encoder are projected then concatenated with text token embeddings to compos… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the PaCo-RoPE as an extension to standard rotary positional embedding to [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: LLM-as-a-judge scores on baselines and ablations: evaluating the factual correctness of [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of one-to-many causal effects plus cross-domain bridging. A single enzymatic [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of multi-regime contrastive reasoning. The same pair of proteins is described [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of state-gated activation cascade. The binding event triggers an ordered chain [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Example of allosteric coupling across named conformational states. Binding at one site (the [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of context-dependent bistable switch. The identical AXIN1 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: K-means clustering of the per-pair evidence score. Top: score distribution colored by [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: UMAP (Uniform manifold approximation and projection) visualization of manifold [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Retention bias as a function of keyword frequency. Most high-frequent keywords in the [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Statistics of the augmented PPI dataset samples with the proposed evidence-tiered [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Structured prompt that instructs the LLM to act as an expert biochemist and compare [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: System prompt used to synthesize the free-text descriptions. A decision block reference [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
read the original abstract

Protein-protein interaction (PPI) modeling has been widely studied as a binary or multi-label classification task. While emerging multimodal large language models (LLMs) can now describe single proteins, they remain unable to generate free-form descriptions of interactions between protein pairs. Moving beyond controlled vocabulary annotations, we propose to model PPI using free-text description, enabling richer expressiveness, improved interpretability, and better integration with literature knowledge base. We present PPI2Text, a multimodal LLM for free-form PPI captioning from amino acid sequences, that encodes each protein using ESM3 encoder, constructs a pair map from the two representations to capture interactions across all residue pairs, and autoregressively generates descriptions using a Qwen3 language decoder. We further introduce PaCo-RoPE, a coordinate-aligned positional encoding that aligns each axis of the pair grid with the residue positions of the corresponding protein. In addition, we release PPI2Text-Dataset, a 351k-pair corpus of free-form PPI descriptions aggregated from ten curated biological databases and further synthesized with Gemini under evidence-tiered prompting. PPI2Text consistently outperforms strong baselines across multiple ablation settings and evaluation protocols. It not only achieves higher scores on linguistic metrics against synthesized references, but also excels on factuality metrics, where an LLM-based judge evaluates outputs against raw biological evidence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces PPI2Text, a multimodal LLM for generating free-form textual captions of protein-protein interactions (PPIs) from amino acid sequences. Proteins are encoded separately with ESM3, a pair-map is constructed to model all residue-pair interactions, and a Qwen3 decoder performs autoregressive generation. A coordinate-aligned positional encoding (PaCo-RoPE) is proposed to align the pair-grid axes with each protein's residue indices. The authors also release the PPI2Text-Dataset: 351k free-form PPI descriptions aggregated from ten biological databases and synthesized via Gemini under evidence-tiered prompting. The central claim is that PPI2Text outperforms strong baselines on linguistic metrics (versus the synthesized references) and on factuality metrics (LLM judge versus raw biological evidence) across multiple ablation settings and evaluation protocols.

Significance. If the reported gains can be shown to arise from the architectural contributions rather than artifacts of the shared synthetic data pipeline, the work would meaningfully extend PPI modeling beyond binary or multi-label classification toward richer, literature-aligned textual descriptions. The public release of the 351k-pair corpus and the introduction of PaCo-RoPE constitute concrete, reusable contributions that could support follow-on research in multimodal biological sequence modeling.

major comments (2)
  1. [Abstract and Dataset Construction] Abstract and dataset construction: linguistic metrics are computed against Gemini-synthesized references that were also used to create the training data; this creates a circularity risk in which reported improvements may reflect imitation of the synthesis style or knowledge rather than independent modeling gains. The factuality protocol (LLM judge versus raw evidence) does not eliminate the concern, as no held-out human-validated test set or non-synthesized reference corpus is described.
  2. [Evaluation Protocols] Evaluation protocols: the abstract asserts consistent outperformance 'across multiple ablation settings and evaluation protocols' yet supplies no numerical baseline scores, ablation deltas, error bars, or statistical tests. Without these details the load-bearing claim of superiority cannot be assessed.
minor comments (1)
  1. [Model Architecture] The description of how the pair-map is constructed from the two ESM3 embeddings would benefit from an explicit equation or small diagram showing the dimensionality and interaction aggregation step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concerns about evaluation circularity and missing quantitative details below, with proposed revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Dataset Construction] Abstract and dataset construction: linguistic metrics are computed against Gemini-synthesized references that were also used to create the training data; this creates a circularity risk in which reported improvements may reflect imitation of the synthesis style or knowledge rather than independent modeling gains. The factuality protocol (LLM judge versus raw evidence) does not eliminate the concern, as no held-out human-validated test set or non-synthesized reference corpus is described.

    Authors: We acknowledge the circularity risk for linguistic metrics, as both training and reference captions originate from the same Gemini synthesis pipeline grounded in database evidence. The test split is held-out, and the factuality protocol evaluates generated text directly against raw database evidence via LLM judge, independent of synthesized references. This mitigates but does not fully eliminate the concern. We will add a small human-validated held-out subset (with inter-annotator agreement) to the revised manuscript and report results against both synthesized and human references. revision: partial

  2. Referee: [Evaluation Protocols] Evaluation protocols: the abstract asserts consistent outperformance 'across multiple ablation settings and evaluation protocols' yet supplies no numerical baseline scores, ablation deltas, error bars, or statistical tests. Without these details the load-bearing claim of superiority cannot be assessed.

    Authors: The full manuscript includes tables with exact baseline scores, ablation deltas, standard deviations across runs, and statistical tests (paired t-tests with p-values). We will revise the abstract to summarize key numerical results (e.g., BLEU/ROUGE gains and factuality improvements with significance) and ensure all figures/tables explicitly report error bars and tests. revision: yes

Circularity Check

0 steps flagged

No significant circularity in model architecture or evaluation chain

full rationale

The paper introduces a multimodal architecture (ESM3 encoder + pair-map + PaCo-RoPE + Qwen3 decoder) trained on a dataset aggregated from ten databases and synthesized via Gemini. Linguistic metrics are reported against held-out synthesized references and factuality metrics use an LLM judge against raw biological evidence. No mathematical derivations, equations, or first-principles results are presented that reduce to inputs by construction. No self-citations are invoked as load-bearing. Standard supervised training and split-based evaluation on explicitly constructed data does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that Gemini-synthesized descriptions accurately reflect biological evidence and that the pair-map plus PaCo-RoPE architecture captures interaction semantics without additional biological priors.

free parameters (1)
  • PaCo-RoPE scaling factors
    Coordinate alignment parameters for the pair grid are introduced but not shown to be derived from first principles.
axioms (1)
  • domain assumption ESM3 and Qwen3 encoders/decoders provide sufficiently rich representations for interaction semantics
    The architecture assumes these off-the-shelf models already encode the necessary cross-protein information.
invented entities (1)
  • PaCo-RoPE positional encoding no independent evidence
    purpose: To align residue positions across the two proteins in the pair map
    New coordinate-aligned RoPE variant introduced for the 2D interaction grid.

pith-pipeline@v0.9.0 · 5566 in / 1370 out tokens · 34647 ms · 2026-05-12T01:56:36.486879+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2505.11194 , year=

    Prot2text-v2: Protein function prediction with multimodal contrastive alignment , author=. arXiv preprint arXiv:2505.11194 , year=

  2. [2]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Prott3: Protein-to-text generation for text-based protein understanding , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  3. [3]

    bioRxiv , pages=

    Decoding the molecular language of proteins with evolla , author=. bioRxiv , pages=. 2025 , publisher=

  4. [4]

    Galactica: A Large Language Model for Science

    Galactica: A large language model for science , author=. arXiv preprint arXiv:2211.09085 , year=

  5. [5]

    bioRxiv , pages=

    Protein-protein interaction prediction is achievable with large language models , author=. bioRxiv , pages=. 2023 , publisher=

  6. [6]

    arXiv preprint arXiv:2405.06649 , year=

    ProLLM: protein chain-of-thoughts enhanced LLM for protein-protein interaction prediction , author=. arXiv preprint arXiv:2405.06649 , year=

  7. [7]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Protllm: An interleaved protein-language llm with protein-as-word pre-training , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  8. [8]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Large language and protein assistant for protein-protein interactions prediction , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  9. [9]

    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    RAGPPI: Retrieval-Augmented Generation Benchmark for Protein--Protein Interactions in Drug Discovery , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  10. [10]

    Computational and Structural Biotechnology Journal , volume=

    Protein--protein interaction prediction with deep learning: A comprehensive review , author=. Computational and Structural Biotechnology Journal , volume=. 2022 , publisher=

  11. [11]

    Nature methods , volume=

    OmniPath: guidelines and gateway for literature-curated signaling pathway resources , author=. Nature methods , volume=. 2016 , publisher=

  12. [12]

    Nucleic acids research , volume=

    ConsensusPathDB—a database for integrating human functional interaction networks , author=. Nucleic acids research , volume=. 2009 , publisher=

  13. [13]

    Interdisciplinary Sciences: Computational Life Sciences , volume=

    A novel protein mapping method for predicting the protein interactions in COVID-19 disease by deep learning , author=. Interdisciplinary Sciences: Computational Life Sciences , volume=. 2021 , publisher=

  14. [14]

    Predict the Protein-protein Interaction between Virus and Host through Hybrid Deep Neural Network , year=

    Deng, Lei and Zhao, Jiaojiao and Zhang, Jingpu , booktitle=. Predict the Protein-protein Interaction between Virus and Host through Hybrid Deep Neural Network , year=

  15. [15]

    The Journal of Physical Chemistry B , volume=

    Residue-frustration-based prediction of protein--protein interactions using machine learning , author=. The Journal of Physical Chemistry B , volume=. 2022 , publisher=

  16. [16]

    Briefings in bioinformatics , volume=

    LSTM-PHV: prediction of human-virus protein--protein interactions by LSTM with word2vec , author=. Briefings in bioinformatics , volume=. 2021 , publisher=

  17. [17]

    Bioinformatics , volume=

    RAPPPID: towards generalizable protein interaction prediction with AWD-LSTM twin networks , author=. Bioinformatics , volume=. 2022 , publisher=

  18. [18]

    Analytical Biochemistry , volume=

    Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism , author=. Analytical Biochemistry , volume=. 2024 , publisher=

  19. [19]

    Analytical biochemistry , volume=

    Improving protein-protein interaction prediction using protein language model and protein network features , author=. Analytical biochemistry , volume=. 2024 , publisher=

  20. [20]

    BMC genomics , volume=

    SDNN-PPI: self-attention with deep neural network effect on protein-protein interaction prediction , author=. BMC genomics , volume=. 2022 , publisher=

  21. [21]

    Computational and Structural Biotechnology Journal , volume=

    AttentionEP: Predicting essential proteins via fusion of multiscale features by attention mechanisms , author=. Computational and Structural Biotechnology Journal , volume=. 2024 , publisher=

  22. [22]

    Research , volume=

    A transformer-based ensemble framework for the prediction of protein--protein interaction sites , author=. Research , volume=. 2023 , publisher=

  23. [23]

    Briefings in Bioinformatics , volume=

    HN-PPISP: a hybrid network based on MLP-Mixer for protein--protein interaction site prediction , author=. Briefings in Bioinformatics , volume=. 2023 , publisher=

  24. [24]

    Nucleic acids research , volume=

    The IntAct molecular interaction database in 2012 , author=. Nucleic acids research , volume=. 2012 , publisher=

  25. [25]

    The NCBI handbook , volume=

    PubMed: the bibliographic database , author=. The NCBI handbook , volume=

  26. [26]

    Nucleic acids research , volume=

    UniProt: a worldwide hub of protein knowledge , author=. Nucleic acids research , volume=. 2019 , publisher=

  27. [27]

    Nucleic acids research , volume=

    3did: a catalog of domain-based interactions of known three-dimensional structure , author=. Nucleic acids research , volume=. 2014 , publisher=

  28. [28]

    Nucleic acids research , volume=

    Pfam: The protein families database in 2021 , author=. Nucleic acids research , volume=. 2021 , publisher=

  29. [29]

    Nucleic acids research , volume=

    STRING: a database of predicted functional associations between proteins , author=. Nucleic acids research , volume=. 2003 , publisher=

  30. [30]

    Nucleic acids research , volume=

    SIGNOR: a database of causal relationships between biological entities , author=. Nucleic acids research , volume=. 2016 , publisher=

  31. [31]

    Nucleic acids research , volume=

    Reactome: a database of reactions, pathways and biological processes , author=. Nucleic acids research , volume=. 2010 , publisher=

  32. [32]

    Nucleic acids research , volume=

    CORUM: the comprehensive resource of mammalian protein complexes—2019 , author=. Nucleic acids research , volume=. 2019 , publisher=

  33. [33]

    Nucleic acids research , volume=

    The complex portal-an encyclopaedia of macromolecular complexes , author=. Nucleic acids research , volume=. 2015 , publisher=

  34. [34]

    Nature biotechnology , volume=

    MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets , author=. Nature biotechnology , volume=. 2017 , publisher=

  35. [35]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  36. [36]

    2018 , publisher=

    Improving language understanding by generative pre-training , author=. 2018 , publisher=

  37. [37]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  38. [38]

    Trends in biochemical sciences , volume=

    Recent advances in predicting and modeling protein--protein interactions , author=. Trends in biochemical sciences , volume=. 2023 , publisher=

  39. [39]

    doi: 10.1126/science.ads0018

    Thomas Hayes and Roshan Rao and Halil Akin and Nicholas J. Sofroniew and Deniz Oktay and Zeming Lin and Robert Verkuil and Vincent Q. Tran and Jonathan Deaton and Marius Wiggert and Rohil Badkundri and Irhum Shafkat and Jun Gong and Alexander Derry and Raul S. Molina and Neil Thomas and Yousuf A. Khan and Chetan Mishra and Carolyn Kim and Liam J. Bartie a...

  40. [40]

    nature , volume=

    Highly accurate protein structure prediction with AlphaFold , author=. nature , volume=. 2021 , publisher=

  41. [41]

    Nature Communications , year=

    Learning the language of protein-protein interactions , author=. Nature Communications , year=

  42. [42]

    Qwen3-VL Technical Report

    Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

  43. [43]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. Iclr , volume=

  44. [44]

    Briefings in Bioinformatics , volume =

    Bernett, Judith and Blumenthal, David B and List, Markus , title =. Briefings in Bioinformatics , volume =. 2024 , month =. doi:10.1093/bib/bbae076 , url =

  45. [45]

    International journal of proteomics , volume=

    Protein-protein interaction detection: methods and analysis , author=. International journal of proteomics , volume=. 2014 , publisher=

  46. [46]

    Chemical reviews , volume=

    Protein- protein interactions: interface structure, binding thermodynamics, and mutational analysis , author=. Chemical reviews , volume=. 1997 , publisher=

  47. [47]

    Bioinformatics , volume=

    Causal reasoning on biological networks: interpreting transcriptional changes , author=. Bioinformatics , volume=. 2012 , publisher=

  48. [48]

    biorxiv , pages=

    Protein complex prediction with AlphaFold-Multimer , author=. biorxiv , pages=. 2021 , publisher=

  49. [49]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  50. [50]

    Nature , volume=

    Accurate structure prediction of biomolecular interactions with AlphaFold 3 , author=. Nature , volume=. 2024 , publisher=