PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

Gabrielle Cohn; Minh Hoang; Rohan Gumaste; Vihan Lakshman

arxiv: 2605.29158 · v1 · pith:KJMFR5RKnew · submitted 2026-05-27 · 💻 cs.LG · cs.IR· q-bio.BM

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

Gabrielle Cohn , Rohan Gumaste , Minh Hoang , Vihan Lakshman This is my paper

Pith reviewed 2026-06-29 13:15 UTC · model grok-4.3

classification 💻 cs.LG cs.IRq-bio.BM

keywords protein homology searchlate interactionresidue embeddingsremote homologyprotein language modelsMaxSimretrievaltwilight zone

0 comments

The pith

Late interaction over per-residue embeddings improves remote protein homolog retrieval compared with pooled vectors or classical alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether scoring candidate proteins by the maximum similarity between their individual residue embeddings recovers remote homologs more effectively than collapsing each protein into a single vector. The approach encodes every protein independently with a language model, stores the residue sets, and applies a ColBERT-style MaxSim operation only at retrieval time. On SCOPe superfamily and Pfam clan benchmarks the method exceeds sequence-composition baselines, alignment tools, pooled language-model vectors, and trained single-vector retrievers. The result indicates that preserving local residue comparisons helps surface conserved motifs when global sequence identity is low.

Core claim

ProtoCol keeps proteins as unordered collections of residue embeddings from a protein language model. Candidates are encoded once and stored. Query-time scoring uses MaxSim, the sum of the maximum cosine similarity each query residue finds among a candidate's residues. On the SCOPe and Pfam remote-homology tasks this late-interaction layer produces higher retrieval accuracy than pooled-embedding or alignment baselines.

What carries the argument

MaxSim late interaction over independently encoded residue embeddings, which compares local patterns without first averaging them into a global vector.

If this is right

Candidate representations can be pre-computed once and reused for many queries without re-encoding.
The same residue-level storage supports retrieval at multiple levels of homology without retraining the underlying language model.
Gains concentrate in the regime where global sequence identity drops below the sensitivity of alignment methods.
The retrieval step adds no parameters beyond the frozen language-model encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same late-interaction pattern could be applied to other sequence families where local conservation matters more than global identity, such as regulatory DNA motifs.
If residue sets remain informative, the method could serve as an initial filter before more expensive structure-alignment searches.
Because encoding and storage are decoupled, the approach scales by adding more stored residue sets rather than retraining.

Load-bearing premise

That the highest pairwise similarity between any two residue embeddings will consistently flag conserved local motifs even when overall sequence similarity is weak.

What would settle it

If a simple average-pooling baseline that uses the identical residue embeddings achieves equal or higher accuracy than ProtoCol on the same SCOPe superfamily and Pfam clan test sets, the claimed benefit of late interaction would not hold.

Figures

Figures reproduced from arXiv: 2605.29158 by Gabrielle Cohn, Minh Hoang, Rohan Gumaste, Vihan Lakshman.

**Figure 1.** Figure 1: ColBERT attention maps for true positive pair. We visualize the residue-level similarity matrix between a representative query and its highest-ranked true positive match. Secondary structure annotations are shown along each axis. The similarity map exhibits block diagonal structure that coincides with secondary structure boundaries, indicating that PROTOCOL indeed learns meaningful structural organizatio… view at source ↗

**Figure 2.** Figure 2: Overview of the ProtoCoL framework for protein homolog retrieval. Variable-length query and candidate protein sequences are encoded with a frozen ESM-2 backbone, projected into residue-level embeddings, and compared using MaxSim scoring. The projection layer is trained with a symmetric contrastive objective, enabling retrieval using precomputed candidate representations [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

read the original abstract

Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context-aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding-based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT-style late interaction to test whether residue-level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre-computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProtoCol applies late interaction to protein residue embeddings for homolog search but the text supplies no numbers, methods, or results to support the outperformance claim.

read the letter

The main point is that this paper takes the ColBERT late-interaction approach and applies it to per-residue embeddings from protein language models for remote homolog retrieval. That specific combination is not in the cited prior work, so the technique itself is new.

It does a reasonable job laying out why global pooling can lose local conserved motifs in the twilight zone and why keeping residue-level comparisons might help. The framing around SCOPe superfamily and Pfam clan benchmarks is standard for the area.

The soft spot is the complete absence of any experimental substance. The abstract states that ProtoCol beats sequence-composition, alignment, pooled PLM, and single-vector baselines, yet gives no protein language model name, no indication of fine-tuning or zero-shot use, no MaxSim implementation details, no candidate-set construction, no numbers, no error bars, and no ablation. The stress-test note is accurate on this: the central claim rests entirely on unreported outcomes, so there is no way to check whether late interaction is responsible for any gains.

The work is not circular; it presents direct comparisons. But without the actual data or setup, the weakest assumption—that MaxSim over independent embeddings will reliably capture remote signals better than the baselines—remains untested.

This is for bioinformatics groups already working on embedding-based retrieval for function annotation or evolutionary analysis. A reader could extract the conceptual idea, but the lack of evidence means it is not ready for serious refereeing. I would not send it to peer review until the methods and results are added and can be evaluated.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces ProtoCol, a late-interaction retrieval model for protein homolog search. Proteins are encoded independently as sets of residue embeddings from protein language models; candidates are scored via ColBERT-style MaxSim over these embeddings. The central claim is that this outperforms sequence-composition, alignment-based, pooled-PLM, and trained single-vector baselines on SCOPe superfamily and Pfam clan benchmarks for remote homology detection in the twilight zone.

Significance. If the empirical superiority holds with proper controls and statistical validation, the work would establish late interaction as a practical retrieval layer that preserves local motif information better than global pooling while remaining pre-computable. This could meaningfully improve sensitivity for remote homolog detection, with downstream value for function annotation and evolutionary analysis.

major comments (1)

[Abstract] Abstract (and entire manuscript): the claim that 'ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines' on SCOPe superfamily and Pfam clan benchmarks is asserted without any numbers, tables, figures, error bars, statistical tests, data-split descriptions, candidate-set construction details, PLM choice, fine-tuning status, or MaxSim implementation. This omission is load-bearing for the central empirical claim and prevents any assessment of whether late interaction, rather than baseline implementation or data artifacts, drives the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comment correctly identifies that the abstract and manuscript require substantially more quantitative and methodological detail to support the central empirical claims. We will revise the manuscript to address this.

read point-by-point responses

Referee: [Abstract] Abstract (and entire manuscript): the claim that 'ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines' on SCOPe superfamily and Pfam clan benchmarks is asserted without any numbers, tables, figures, error bars, statistical tests, data-split descriptions, candidate-set construction details, PLM choice, fine-tuning status, or MaxSim implementation. This omission is load-bearing for the central empirical claim and prevents any assessment of whether late interaction, rather than baseline implementation or data artifacts, drives the reported gains.

Authors: We agree that the abstract as written provides only a qualitative summary and that the manuscript must make the supporting evidence fully transparent. The revised version will expand the abstract to report key performance numbers (with error bars and statistical tests where appropriate) for the SCOPe and Pfam benchmarks. We will also add or expand a methods subsection that explicitly describes: (i) the data splits and candidate-set construction, (ii) the specific PLM used and its fine-tuning status, and (iii) the precise MaxSim implementation. These additions will allow readers to evaluate whether the observed gains are attributable to late interaction rather than implementation details. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external benchmarks

full rationale

The manuscript presents ProtoCol as an application of ColBERT-style late interaction to residue embeddings from protein language models, then reports direct empirical outperformance on SCOPe superfamily and Pfam clan benchmarks against sequence-composition, alignment-based, pooled PLM, and single-vector baselines. No equations, parameter-fitting steps, uniqueness theorems, or self-citations appear in the provided text that would reduce any claimed result to a redefinition or re-use of the paper's own inputs. The central claim is therefore an independent empirical comparison rather than a constructed prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that residue embeddings contain motif-level signal that MaxSim can exploit; this is a domain assumption with no independent evidence supplied in the abstract.

axioms (1)

domain assumption Residue embeddings from protein language models preserve local motif information useful for remote homology detection
Invoked as the justification for using per-residue rather than pooled representations.

pith-pipeline@v0.9.1-grok · 5698 in / 1127 out tokens · 45605 ms · 2026-06-29T13:15:20.368411+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

findings-emnlp.110/

URL https://aclanthology.org/2023. findings-emnlp.110/. Dhulipala, L., Hadian, M., Jayaram, R., Lee, J., and Mir- rokni, V . Muvera: Multi-vector retrieval via fixed di- mensional encoding.Advances in Neural Information Processing Systems, 37:101042–101073, 2024. Eddy, S. R. Accelerated profile HMM searches.PLOS Computational Biology, 7(10):e1002195, 2011...

work page doi:10.1101/gr.279127 2023
[2]

doi: 10.1093/nar/gkaa913

ISSN 0305-1048. doi: 10.1093/nar/gkaa913. URL https://doi.org/10.1093/nar/gkaa913. Reimers, N. and Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3982–3992, 2019. Santhanam, K., Khattab, O., Potts, C., and Zaharia, M. Plaid: an eff...

work page doi:10.1093/nar/gkaa913 2019

[1] [1]

findings-emnlp.110/

URL https://aclanthology.org/2023. findings-emnlp.110/. Dhulipala, L., Hadian, M., Jayaram, R., Lee, J., and Mir- rokni, V . Muvera: Multi-vector retrieval via fixed di- mensional encoding.Advances in Neural Information Processing Systems, 37:101042–101073, 2024. Eddy, S. R. Accelerated profile HMM searches.PLOS Computational Biology, 7(10):e1002195, 2011...

work page doi:10.1101/gr.279127 2023

[2] [2]

doi: 10.1093/nar/gkaa913

ISSN 0305-1048. doi: 10.1093/nar/gkaa913. URL https://doi.org/10.1093/nar/gkaa913. Reimers, N. and Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3982–3992, 2019. Santhanam, K., Khattab, O., Potts, C., and Zaharia, M. Plaid: an eff...

work page doi:10.1093/nar/gkaa913 2019