pith. machine review for the scientific record. sign in

arxiv: 2605.13789 · v2 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· q-bio.BM

Recognition: no theorem link

ENSEMBITS: an alphabet of protein conformational ensembles

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.BM
keywords protein conformational ensemblesstructure tokenizermolecular dynamicsVQ-VAEprotein dynamicsRMSF predictionprotein language modeling
0
0 comments X

The pith

Ensembits turns protein conformational ensembles into a discrete token vocabulary.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Protein structure tokenizers have so far encoded only the geometry of single fixed shapes. Ensembits extends this to variable ensembles of conformations by training a Residual VQ-VAE on molecular-dynamics trajectories with a frame-distillation objective. The resulting tokens capture correlated motions and alternative states while remaining permutation-invariant across ensemble size. The distillation step further lets the model recover dynamics information from one static structure, sidestepping the scarcity of full ensemble data. Benchmarks show gains on RMSF and motion-amplitude tests plus parity with static tokenizers on functional and mutational tasks despite smaller pretraining sets.

Core claim

Ensembits is the first tokenizer of protein conformational ensembles. It is trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus. The model derives informative geometric descriptors across conformations, encodes variable-size ensembles in a permutation-invariant manner, and produces tokens that support accurate prediction of per-residue fluctuations. The distillation objective enables dynamics tokens to be inferred from a single predicted structure. On evaluation, Ensembits outperforms prior methods on RMSF prediction, leads on token-conditioned ANOVA tests of motion amplitude, and matches or exceeds static tokenizers on enzyme commission,

What carries the argument

Residual VQ-VAE equipped with a frame distillation objective that compresses ensemble dynamics into tokens recoverable from individual structures.

Load-bearing premise

The frame distillation objective can reliably extract dynamics tokens from a single predicted structure to compensate for limited ensemble data.

What would settle it

If Ensembits shows no improvement over static tokenizers when RMSF is predicted from single structures alone, the claimed benefit of the distillation objective would be refuted.

Figures

Figures reproduced from arXiv: 2605.13789 by Carlos Oliver, Kaiwen Shi.

Figure 1
Figure 1. Figure 1: The Ensembits tokenization pipeline. The [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Token 1699 — a near-stationary local motif. Three distinct-protein exemplars (1e6vC00:99, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Token 1063 — a flexible local motif. Three exemplars (1c6rA00:23, 1gytL01:140, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Skill-score radar across MISATO downstream tasks. Each panel is one ProteinShake [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training curve. Optimization. AdamW [Kingma and Ba, 2017] with initial learning rate 10−3 , weight decay 10−5 , and a 1000-step linear warm-up followed by a cosine schedule decaying to 10−6 over the full training horizon. Batch size 4096, gradient clipping at norm 1.0. We train for at most 1000 epochs with early stopping on validation reconstruction loss (patience 40 epochs); the final run converged at epo… view at source ↗
Figure 6
Figure 6. Figure 6: Five representative tokens from the codebook (rows), ordered top to bottom by increasing [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE projection of the L1 primary codebook (M = 2048 entries, d = 128) for the production tokenizer (combined mdCATH + MISATO training, k = 16, P = 10). Each point is one codebook entry; the color of used codes encodes log10(1+ usage count) across the combined corpus, while the 3 unused codes are drawn in light gray. The twelve most-used token IDs are annotated in red. Codebook utilization is 99.85% with … view at source ↗
Figure 8
Figure 8. Figure 8: ANOVA test that different tokens encode different dynamics. Per-residue motion amplitude [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Ensembits, the first tokenizer for protein conformational ensembles. It uses a Residual VQ-VAE trained on a large molecular dynamics corpus with a frame distillation objective to address challenges in deriving geometric descriptors, handling variable-size ensembles, and overcoming dynamics data sparsity. The central claims are that Ensembits outperforms related methods on RMSF prediction and is the strongest standalone structural tokenizer on a token-conditioned ANOVA test for per-residue motion amplitude; it also matches or exceeds static tokenizers on EC, GO, binding site/affinity, and zero-shot mutation-effect prediction tasks despite using far less pretraining data. The distillation objective is presented as enabling dynamics token prediction from a single predicted structure.

Significance. If the empirical claims hold after detailed validation, Ensembits would supply a discrete vocabulary for incorporating conformational dynamics into protein language modeling and design, filling a gap left by static structure tokenizers. The frame distillation approach to mitigate ensemble data sparsity represents a potentially useful technical contribution, though its impact must be demonstrated through controlled experiments rather than asserted.

major comments (3)
  1. [Abstract/Results] Abstract and Results: The claims of outperformance on RMSF prediction and superiority on the token-conditioned ANOVA test for per-residue motion amplitude are stated without any quantitative metrics, baseline comparisons, numerical values, error bars, or statistical details. These omissions make it impossible to assess the magnitude or reliability of the reported gains.
  2. [Methods] Methods: No ablation studies or control experiments are described that isolate the contribution of the frame distillation objective from the Residual VQ-VAE architecture itself. Without such controls, it remains unclear whether the objective successfully injects conformational variance information or simply re-encodes static geometry.
  3. [Methods/Results] Methods/Results: The manuscript does not specify the nature of the single-structure inputs used at inference time for dynamics token prediction (e.g., ensemble averages, AlphaFold models, or experimental structures). This detail is load-bearing for claims about alleviating dynamics data sparsity and for reproducibility.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'token-conditioned ANOVA test' would benefit from a one-sentence clarification of the exact statistical procedure and how tokens condition the test.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major point below and have revised the manuscript to incorporate the requested details and experiments.

read point-by-point responses
  1. Referee: [Abstract/Results] Abstract and Results: The claims of outperformance on RMSF prediction and superiority on the token-conditioned ANOVA test for per-residue motion amplitude are stated without any quantitative metrics, baseline comparisons, numerical values, error bars, or statistical details. These omissions make it impossible to assess the magnitude or reliability of the reported gains.

    Authors: We agree that quantitative support is essential. The revised manuscript now includes specific metrics in both the abstract and results: RMSF prediction MAE and Pearson correlation with error bars across baselines, plus F-statistics, p-values, and effect sizes from the token-conditioned ANOVA. These values were present in our internal analyses and are now explicitly reported with statistical details. revision: yes

  2. Referee: [Methods] Methods: No ablation studies or control experiments are described that isolate the contribution of the frame distillation objective from the Residual VQ-VAE architecture itself. Without such controls, it remains unclear whether the objective successfully injects conformational variance information or simply re-encodes static geometry.

    Authors: We acknowledge this gap. The revised Methods and Results sections now include ablation experiments: a Residual VQ-VAE trained without the frame distillation objective versus the full model. These controls show that the distillation objective improves capture of conformational variance (measured by higher variance in token usage across ensemble members) beyond static geometry re-encoding, with quantitative comparisons provided. revision: yes

  3. Referee: [Methods/Results] Methods/Results: The manuscript does not specify the nature of the single-structure inputs used at inference time for dynamics token prediction (e.g., ensemble averages, AlphaFold models, or experimental structures). This detail is load-bearing for claims about alleviating dynamics data sparsity and for reproducibility.

    Authors: We thank the referee for noting this omission. The single-structure inputs at inference are AlphaFold2 predictions (processed via the same frame extraction pipeline as training). This is now explicitly stated in the revised Methods section, along with preprocessing details, to support reproducibility and the sparsity-alleviation claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a new Residual VQ-VAE tokenizer trained on an external MD corpus with a frame distillation objective. This objective is a training-time mechanism to enable single-structure inference and is not defined in terms of the reported downstream metrics (RMSF, EC/GO, mutation effects). No load-bearing self-citations or self-definitional reductions appear; the central claims rest on independent training data and external benchmarks rather than tautological re-encoding of fitted inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; training assumes MD simulations provide representative dynamics, and the distillation objective assumes single-structure input suffices to recover ensemble statistics.

axioms (1)
  • domain assumption Molecular dynamics simulations produce ensembles that capture biologically relevant conformational dynamics
    Invoked as the training corpus for the VQ-VAE

pith-pipeline@v0.9.0 · 5515 in / 1159 out tokens · 26700 ms · 2026-05-15T04:48:50.298297+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

  1. [1]

    Jukebox: A Generative Model for Music

    URLhttps://arxiv.org/abs/2005.00341. Vladimir Gligorijevi´c, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Beren- berg, Tommi Vatanen, Chris Chandler, Bryn C. Taylor, Ian M. Fisk, Hera Vlamakis, Ramnik J. Xavier, Rob Knight, Kyunghyun Cho, and Richard Bonneau. Structure-based protein func- tion prediction using graph convolutional netw...

  2. [2]

    doi: 10.1038/s41467-021-23303-9

    ISSN 2041-1723. doi: 10.1038/s41467-021-23303-9. URL http://dx.doi.org/10.1038/ s41467-021-23303-9. Pengkang Guo, Bruno Correia, Pierre Vandergheynst, and Daniel Probst. Boosting protein graph representations through static-dynamic fusion.bioRxiv, pages 2025–02,

  3. [3]

    doi: 10.1126/science.ads0018

    ISSN 1095-9203. doi: 10.1126/science.ads0018. URL http://dx.doi.org/10.1126/science.ads0018. Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steineg- ger, and Burkhard Rost. Prostt5: Bilingual language model for protein sequence and structure. bioRxiv,

  4. [4]

    URL https://www.biorxiv.org/content/ early/2023/07/25/2023.07.23.550085

    doi: 10.1101/2023.07.23.550085. URL https://www.biorxiv.org/content/ early/2023/07/25/2023.07.23.550085. Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo¯ao Carreira. ...

  5. [5]

    Bowen Jing, Bonnie Berger, and Tommi Jaakkola

    URLhttps://arxiv.org/abs/2107.14795. Bowen Jing, Bonnie Berger, and Tommi Jaakkola. Alphafold meets flow matching for generating protein ensembles,

  6. [6]

    URLhttps://arxiv.org/abs/2402.04845. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera- Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Tr...

  7. [7]

    doi: 10.1038/s41586-021-03819-2

    ISSN 1476-4687. doi: 10.1038/s41586-021-03819-2. URLhttp://dx.doi.org/10.1038/s41586-021-03819-2. W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors.Acta Crys- tallographica Section A, 34(5):827–828,

  8. [8]

    URLhttps://onlinelibrary.wiley.com/doi/abs/10.1107/S0567739478001680

    doi: https://doi.org/10.1107/S0567739478001680. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1107/S0567739478001680. Yogesh Kalakoti and Björn Wallner. Afsample2: Predicting multiple conformations and ensembles with alphafold2.bioRxiv,

  9. [9]

    URL https://www.biorxiv

    doi: 10.1101/2024.05.28.596195. URL https://www.biorxiv. org/content/early/2024/06/02/2024.05.28.596195. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,

  10. [10]

    Adam: A Method for Stochastic Optimization

    URL https://arxiv.org/abs/1412.6980. Tim Kucera, Carlos Oliver, Dexiong Chen, and Karsten Borgwardt. Proteinshake: Build- ing datasets and benchmarks for deep learning on protein structures. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 58277–58289. C...

  11. [11]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ b6167294ed3d6fc61e11e1592ce5cb77-Paper-Datasets_and_Benchmarks.pdf. 11 H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,

  12. [12]

    URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109

    doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109. Myeongsang Lee, Joseph W. Schafer, Jeshuwin Prabakaran, Devlina Chakravarty, Madeleine F. Clore, and Lauren L. Porter. Large-scale predictions of alternative protein conformations by alphafold2- based sequence association.Nature Communications...

  13. [13]

    doi: 10.1038/s41467-025-60759-5

    ISSN 2041-1723. doi: 10.1038/s41467-025-60759-5. URLhttp://dx.doi.org/10.1038/s41467-025-60759-5. Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y . K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Adam E. Foster, Arne Schneuing, Jigyasa Nigam, Federico Barbero,...

  14. [14]

    representation

    doi: 10.1126/science.adv9817. URL https://www.science.org/doi/abs/ 10.1126/science.adv9817. Xiaohan Lin, Zhenyu Chen, Yanheng Li, Xingyu Lu, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Yi Qin Gao, and Jun Zhang. Protokens: A machine-learned language for compact and informative encoding of protein 3d structures.bioRxiv, 2023a. doi: 10.1101/2023.11.27.568722. U...

  15. [15]

    doi: 10.1038/ s41467-024-46808-5

    ISSN 2041-1723. doi: 10.1038/ s41467-024-46808-5. URLhttp://dx.doi.org/10.1038/s41467-024-46808-5. Valentin Lombard, Sergei Grudinin, and Elodie Laine. Petimot: A novel framework for inferring protein motions from sparse data using se(3)-equivariant graph neural networks,

  16. [16]

    Ilya Loshchilov and Frank Hutter

    URL https://arxiv.org/abs/2504.02839. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts,

  17. [17]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    URL https://arxiv.org/abs/1608.03983. Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. Structure language models for protein conformation generation,

  18. [18]

    URL https: //arxiv.org/abs/2410.18403. Finn H. Lüth, Victor Mihaila, Milot Mirdita, Martin Steinegger, Burkhard Rost, and Michael Heinzinger. Protein language modeling beyond static folds reveals sequence-encoded flexibility. bioRxiv,

  19. [19]

    URL https://www.biorxiv.org/content/ early/2026/01/22/2026.01.21.700698

    doi: 10.64898/2026.01.21.700698. URL https://www.biorxiv.org/content/ early/2026/01/22/2026.01.21.700698. Antonio Mirarchi, Toni Giorgino, and Gianni De Fabritiis. mdcath: A large-scale md dataset for data- driven computational biophysics.Scientific Data, 11(1), November

  20. [20]

    doi: 10.1038/s41597-024-04140-z

    ISSN 2052-4463. doi: 10.1038/s41597-024-04140-z. URLhttp://dx.doi.org/10.1038/s41597-024-04140-z. 12 Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spin- ner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Yarin Gal, and Debora Marks. Proteingym: Large- scale be...

  21. [21]

    URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf. Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Do- minique Beaini, Tommi Jaakkola, and Regina Barzilay. ...

  22. [22]

    URL https://www.biorxiv.org/content/early/2025/06/18/2025.06.14.659707

    doi: 10.1101/2025.06.14.659707. URL https://www.biorxiv.org/content/early/2025/06/18/2025.06.14.659707. Nicolas Portal, Wissam Karroucha, Vincent Mallet, and Massimiliano Bonomi. Learning dynamic protein representations at scale with distograms. February

  23. [23]

    URLhttp://dx.doi.org/10.64898/2026.01.29.702509

    doi: 10.64898/2026.01.29.702509. URLhttp://dx.doi.org/10.64898/2026.01.29.702509. Till Siebenmorgen, Filipe Menezes, Sabrina Benassou, Erinc Merdivan, Kieran Didi, André San- tos Dias Mourão, Radosław Kitel, Pietro Liò, Stefan Kesselheim, Marie Piraud, Fabian J. Theis, Michael Sattler, and Grzegorz M. Popowicz. Misato: machine learning dataset of protein–...

  24. [24]

    doi: 10.1038/s43588-024-00627-2

    ISSN 2662-8457. doi: 10.1038/s43588-024-00627-2. URL http://dx.doi.org/10.1038/s43588-024-00627-2. Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, and Marinka Zitnik. Protein structure tokenization via geometric byte pair encoding,

  25. [25]

    Neural Discrete Representation Learning

    URLhttps://arxiv.org/abs/1711.00937. Michel van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Foldseek: fast and accurate protein structure search. Biorxiv, pages 2022–02,

  26. [26]

    doi: 10.1093/nar/gkad1084

    ISSN 1362-4962. doi: 10.1093/nar/gkad1084. URLhttp://dx.doi.org/10.1093/nar/gkad1084. A. Vasuki and Ponnusamy Thangapandian Vanathi. A review of vector quantization techniques. IEEE Potentials, 25:39–47,

  27. [27]

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi

    URLhttps://arxiv.org/abs/2503.00089. Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Sound- stream: An end-to-end neural audio codec,

  28. [28]

    URL https://arxiv.org/abs/2107.03312. Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen, Frank Noé, Haiguang Liu, and Tie-Yan Liu. Predicting equilibrium distributions for molecular systems with deep learning.Nature Machine Intelligence...

  29. [29]

    doi: 10.1038/s42256-024-00837-3

    ISSN 2522-5839. doi: 10.1038/s42256-024-00837-3. URLhttp://dx.doi.org/10.1038/s42256-024-00837-3. 13 A Descriptor specification This appendix gives the full definition of the ENSEMBITS descriptors used as input to the RVQ-V AE. We describe two descriptor families sharing the same neighbour-selection machinery but differing in their per-neighbour feature b...

  30. [30]

    For a neighbour pair (i, j) we append a 4D block (sinψ i,cosψ i,sinψ j,cosψ j)

    Backbone ψ dihedral (4D, optional).The backbone ψ dihedral ψr at residue r is the tor- sion angle defined by the four consecutive backbone atoms (Nr,Cα r,C r,N r+1), encoded as (sinψ r,cosψ r) to avoid the ±180◦ wrap. For a neighbour pair (i, j) we append a 4D block (sinψ i,cosψ i,sinψ j,cosψ j). Termini and residues for which the next residue’s N is unav...

  31. [31]

    The descriptor is SE(3)-invariant by construction: a global rigid-body motion of the entire structure left-multiplies every T p j by the same group element, which then cancels under the (T p r )−1 ◦T p j com- position. Real backbone N/Cα/C are required and are sourced per dataset (mdCATH h5 trajectories, MISATO MD); when only Cα is available, N and C are ...

  32. [32]

    ,CK, with associated nearest-neighbor quantization operators Qℓ(ρ) = arg min e∈Cℓ ∥ρ−e∥

    16 B Model Details B.1 RVQ-V AE RVQ-V AE [Zeghidour et al., 2021] is a multi-stage extension [Vasuki and Vanathi, 2006] in which a continuous embedding is approximated by a sum of K codebook entries, one drawn from each of K independently learned codebooks C1, . . . ,CK, with associated nearest-neighbor quantization operators Qℓ(ρ) = arg min e∈Cℓ ∥ρ−e∥. C...

  33. [33]

    Algorithm 1Residual Vector Quantization Require:z , the latent produced by the set encoder; codebooks Cℓ with quantization operators Qℓ for ℓ= 1,

    The straight-through estimator is applied to thesummed quantized embedding q, so the encoder receives a single clean reconstruction gradient through the bottleneck while the codebooks themselves are updated separately (discussed below). Algorithm 1Residual Vector Quantization Require:z , the latent produced by the set encoder; codebooks Cℓ with quantizati...

  34. [34]

    • Residual quantizer: K= 3 levels with codebook sizes [L1, L2, L3] = [2048,128,128] , giving an addressable token space of L1 ·L 2 ·L 3 ≈3.4×10

  35. [35]

    Codebooks are updated by EMA with decay γ= 0.99

    are used. Codebooks are updated by EMA with decay γ= 0.99 . Codes whose EMA usage count drops below 1 are reseeded to a uniformly sampled encoder output from the current batch. • Decoder: ndec = 3-layer MLP with hidden size 256 and GELU activations, mapping the quantized latentˆy∈R 128 toP= 10descriptor vectors. • Total trainable parameters:≈3.4M. Loss.Th...

  36. [36]

    token-central

    Descriptors are standardized to zero mean and unit variance per feature using statistics computed on the training split; the same (µ, σ) are bundled with the model checkpoint for downstream inference. During training, the number of input frames is sampled uniformly from peff ∼ U {1, . . . ,10} at each step (the variable-Pschedule used by SFTD), so the enc...

  37. [37]

    20 T oken 1039 s1 =0.98 6DH1 res 85 dz=2.66 4FIV res 99 dz=4.77 1B6P res 84 dz=4.81 1FMB res 90 dz=4.84 5SY3 res 308 dz=6.19 0 1 2 3 4 frame T oken 1865 s1 =1.30 4OCX res 265 dz=3.12 5HHX res 380 dz=5.33 2NTF res 52 dz=5.33 1GSZ res 357 dz=8.32 5JIC res 363 dz=8.56 0 1 2 3 4 frame T oken 1815 s1 =2.09 5JID res 117 dz=3.36 3KGT res 345 dz=3.46 2B9A res 224...

  38. [38]

    Two observations are notable. First, the codebook achieves near-complete utilization: 2045/2048 = 99.85% of codes are assigned at least one residue across the corpus, with no dead-code clusters — the three unused codes (gray) appear as isolated points scattered among the live ones, not as a connected region. This confirms that the EMA codebook update comb...

  39. [39]

    token-explained

    model: k=16, P=10, descriptor=esm3desc utilization 2045/2048 = 99.9% unused (3) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log10 (1 + usage count) Figure 7: t-SNE projection of the L1 primary codebook ( M= 2048 entries, d= 128 ) for the production tokenizer (combined mdCATH + MISATO training,k= 16 , P= 10 ). Each point is one codebook entry; the color of used codes ...

  40. [40]

    an order of magnitude

    Comparison to existing structural tokenizers.We run the same η2 test using each baseline tokenizer’s per-residue token assignment as the grouping variable, on the same ∼302k mdCATH- div residues (Table 5). ENSEMBITS has the strongest amplitude conditioning by a wide margin: η2 = 0.371 against ≤0.128 for every other tokenizer. The gap separates into four r...

  41. [41]

    The MLP head with random protein features hits the same ceiling as the MLP withESM3struct,ProToken, Vote_3Di, or ENSEMBITS, because all of these protein representations are out-of-distribution on the CATH-H-disjoint test split and the MLP cannot extract a generalising signal from them. What remains is the MACCS-167 ligand fingerprint, which is in-distribu...