pith. sign in

arxiv: 2512.15133 · v2 · pith:P7I3FJOPnew · submitted 2025-12-17 · 💻 cs.CE · cs.AI

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

classification 💻 cs.CE cs.AI
keywords proteincontinuouslanguageplmsdiffusionsequence-structurestructuretokens
0
0 comments X
read the original abstract

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

    cs.IR 2026-04 unverdicted novelty 5.0

    PriHA combines query optimization with a Dual Retrieval Augmented Generation pipeline to improve accuracy and clarity of LLM responses on fragmented Hong Kong primary care guidelines.

  2. PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

    cs.IR 2026-04 unverdicted novelty 5.0

    PriHA is a tri-stage RAG framework with query optimization and dual retrieval that outperforms baselines on accuracy and clarity for Hong Kong primary healthcare queries.