pith. machine review for the scientific record. sign in

arxiv: 2510.01632 · v2 · submitted 2025-10-02 · 🧬 q-bio.BM · cs.AI

Recognition: unknown

BioBlobs: Unsupervised Discovery of Functional Substructures for Protein Function Prediction

Authors on Pith no claims yet
classification 🧬 q-bio.BM cs.AI
keywords proteinfunctionfunctionalblobscatalyticdiscoveryemphonly
0
0 comments X
read the original abstract

Protein function is driven by cohesive substructures, such as catalytic triads, binding pockets, and structural motifs, that occupy only a small fraction of a protein's residues. Yet existing pipelines built on protein encoders do not model proteins at the substructure level, leaving the central biological question unanswered: \emph{which substructure of a protein is responsible for its function?} We introduce \tool, an encoder-agnostic, end-to-end differentiable framework that compresses a protein into a small set of cohesive substructures (\emph{blobs}) and predicts function from these blobs alone, so that each blob corresponds to a candidate functional region. Across diverse protein function prediction tasks and multiple sequence- and structure-based encoders, \tool matches or exceeds strong baselines while operating on only a small fraction of residues. The discovered \emph{blobs} adapt their spatial scale to the task, ranging from local catalytic sites to entire structural domains. Trained only on protein-level labels, \tool recovers experimentally annotated catalytic sites in the M-CSA database, demonstrating unsupervised functional substructure discovery and opening a path to large-scale functional site discovery across the unannotated proteome.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

    cs.LG 2026-05 unverdicted novelty 7.0

    SoftBlobGIN combines ESM-2 representations with protein contact graphs via a lightweight GNN and differentiable substructure pooling to achieve 92.8% accuracy on enzyme classification, raise binding-site AUROC to 0.98...