Recognition: no theorem link
ENSEMBITS: an alphabet of protein conformational ensembles
Pith reviewed 2026-05-15 04:48 UTC · model grok-4.3
The pith
Ensembits turns protein conformational ensembles into a discrete token vocabulary.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ensembits is the first tokenizer of protein conformational ensembles. It is trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus. The model derives informative geometric descriptors across conformations, encodes variable-size ensembles in a permutation-invariant manner, and produces tokens that support accurate prediction of per-residue fluctuations. The distillation objective enables dynamics tokens to be inferred from a single predicted structure. On evaluation, Ensembits outperforms prior methods on RMSF prediction, leads on token-conditioned ANOVA tests of motion amplitude, and matches or exceeds static tokenizers on enzyme commission,
What carries the argument
Residual VQ-VAE equipped with a frame distillation objective that compresses ensemble dynamics into tokens recoverable from individual structures.
Load-bearing premise
The frame distillation objective can reliably extract dynamics tokens from a single predicted structure to compensate for limited ensemble data.
What would settle it
If Ensembits shows no improvement over static tokenizers when RMSF is predicted from single structures alone, the claimed benefit of the distillation objective would be refuted.
Figures
read the original abstract
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Ensembits, the first tokenizer for protein conformational ensembles. It uses a Residual VQ-VAE trained on a large molecular dynamics corpus with a frame distillation objective to address challenges in deriving geometric descriptors, handling variable-size ensembles, and overcoming dynamics data sparsity. The central claims are that Ensembits outperforms related methods on RMSF prediction and is the strongest standalone structural tokenizer on a token-conditioned ANOVA test for per-residue motion amplitude; it also matches or exceeds static tokenizers on EC, GO, binding site/affinity, and zero-shot mutation-effect prediction tasks despite using far less pretraining data. The distillation objective is presented as enabling dynamics token prediction from a single predicted structure.
Significance. If the empirical claims hold after detailed validation, Ensembits would supply a discrete vocabulary for incorporating conformational dynamics into protein language modeling and design, filling a gap left by static structure tokenizers. The frame distillation approach to mitigate ensemble data sparsity represents a potentially useful technical contribution, though its impact must be demonstrated through controlled experiments rather than asserted.
major comments (3)
- [Abstract/Results] Abstract and Results: The claims of outperformance on RMSF prediction and superiority on the token-conditioned ANOVA test for per-residue motion amplitude are stated without any quantitative metrics, baseline comparisons, numerical values, error bars, or statistical details. These omissions make it impossible to assess the magnitude or reliability of the reported gains.
- [Methods] Methods: No ablation studies or control experiments are described that isolate the contribution of the frame distillation objective from the Residual VQ-VAE architecture itself. Without such controls, it remains unclear whether the objective successfully injects conformational variance information or simply re-encodes static geometry.
- [Methods/Results] Methods/Results: The manuscript does not specify the nature of the single-structure inputs used at inference time for dynamics token prediction (e.g., ensemble averages, AlphaFold models, or experimental structures). This detail is load-bearing for claims about alleviating dynamics data sparsity and for reproducibility.
minor comments (1)
- [Abstract] Abstract: The phrase 'token-conditioned ANOVA test' would benefit from a one-sentence clarification of the exact statistical procedure and how tokens condition the test.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of the manuscript. We address each major point below and have revised the manuscript to incorporate the requested details and experiments.
read point-by-point responses
-
Referee: [Abstract/Results] Abstract and Results: The claims of outperformance on RMSF prediction and superiority on the token-conditioned ANOVA test for per-residue motion amplitude are stated without any quantitative metrics, baseline comparisons, numerical values, error bars, or statistical details. These omissions make it impossible to assess the magnitude or reliability of the reported gains.
Authors: We agree that quantitative support is essential. The revised manuscript now includes specific metrics in both the abstract and results: RMSF prediction MAE and Pearson correlation with error bars across baselines, plus F-statistics, p-values, and effect sizes from the token-conditioned ANOVA. These values were present in our internal analyses and are now explicitly reported with statistical details. revision: yes
-
Referee: [Methods] Methods: No ablation studies or control experiments are described that isolate the contribution of the frame distillation objective from the Residual VQ-VAE architecture itself. Without such controls, it remains unclear whether the objective successfully injects conformational variance information or simply re-encodes static geometry.
Authors: We acknowledge this gap. The revised Methods and Results sections now include ablation experiments: a Residual VQ-VAE trained without the frame distillation objective versus the full model. These controls show that the distillation objective improves capture of conformational variance (measured by higher variance in token usage across ensemble members) beyond static geometry re-encoding, with quantitative comparisons provided. revision: yes
-
Referee: [Methods/Results] Methods/Results: The manuscript does not specify the nature of the single-structure inputs used at inference time for dynamics token prediction (e.g., ensemble averages, AlphaFold models, or experimental structures). This detail is load-bearing for claims about alleviating dynamics data sparsity and for reproducibility.
Authors: We thank the referee for noting this omission. The single-structure inputs at inference are AlphaFold2 predictions (processed via the same frame extraction pipeline as training). This is now explicitly stated in the revised Methods section, along with preprocessing details, to support reproducibility and the sparsity-alleviation claim. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper presents a new Residual VQ-VAE tokenizer trained on an external MD corpus with a frame distillation objective. This objective is a training-time mechanism to enable single-structure inference and is not defined in terms of the reported downstream metrics (RMSF, EC/GO, mutation effects). No load-bearing self-citations or self-definitional reductions appear; the central claims rest on independent training data and external benchmarks rather than tautological re-encoding of fitted inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Molecular dynamics simulations produce ensembles that capture biologically relevant conformational dynamics
Reference graph
Works this paper leans on
-
[1]
Jukebox: A Generative Model for Music
URLhttps://arxiv.org/abs/2005.00341. Vladimir Gligorijevi´c, P. Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Leman, Daniel Beren- berg, Tommi Vatanen, Chris Chandler, Bryn C. Taylor, Ian M. Fisk, Hera Vlamakis, Ramnik J. Xavier, Rob Knight, Kyunghyun Cho, and Richard Bonneau. Structure-based protein func- tion prediction using graph convolutional netw...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[2]
doi: 10.1038/s41467-021-23303-9
ISSN 2041-1723. doi: 10.1038/s41467-021-23303-9. URL http://dx.doi.org/10.1038/ s41467-021-23303-9. Pengkang Guo, Bruno Correia, Pierre Vandergheynst, and Daniel Probst. Boosting protein graph representations through static-dynamic fusion.bioRxiv, pages 2025–02,
-
[3]
ISSN 1095-9203. doi: 10.1126/science.ads0018. URL http://dx.doi.org/10.1126/science.ads0018. Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steineg- ger, and Burkhard Rost. Prostt5: Bilingual language model for protein sequence and structure. bioRxiv,
-
[4]
URL https://www.biorxiv.org/content/ early/2023/07/25/2023.07.23.550085
doi: 10.1101/2023.07.23.550085. URL https://www.biorxiv.org/content/ early/2023/07/25/2023.07.23.550085. Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and Jo¯ao Carreira. ...
-
[5]
Bowen Jing, Bonnie Berger, and Tommi Jaakkola
URLhttps://arxiv.org/abs/2107.14795. Bowen Jing, Bonnie Berger, and Tommi Jaakkola. Alphafold meets flow matching for generating protein ensembles,
-
[6]
URLhttps://arxiv.org/abs/2402.04845. John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera- Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Tr...
-
[7]
doi: 10.1038/s41586-021-03819-2
ISSN 1476-4687. doi: 10.1038/s41586-021-03819-2. URLhttp://dx.doi.org/10.1038/s41586-021-03819-2. W. Kabsch. A discussion of the solution for the best rotation to relate two sets of vectors.Acta Crys- tallographica Section A, 34(5):827–828,
-
[8]
URLhttps://onlinelibrary.wiley.com/doi/abs/10.1107/S0567739478001680
doi: https://doi.org/10.1107/S0567739478001680. URLhttps://onlinelibrary.wiley.com/doi/abs/10.1107/S0567739478001680. Yogesh Kalakoti and Björn Wallner. Afsample2: Predicting multiple conformations and ensembles with alphafold2.bioRxiv,
-
[9]
doi: 10.1101/2024.05.28.596195. URL https://www.biorxiv. org/content/early/2024/06/02/2024.05.28.596195. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization,
-
[10]
Adam: A Method for Stochastic Optimization
URL https://arxiv.org/abs/1412.6980. Tim Kucera, Carlos Oliver, Dexiong Chen, and Karsten Borgwardt. Proteinshake: Build- ing datasets and benchmarks for deep learning on protein structures. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 58277–58289. C...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ b6167294ed3d6fc61e11e1592ce5cb77-Paper-Datasets_and_Benchmarks.pdf. 11 H. W. Kuhn. The hungarian method for the assignment problem.Naval Research Logistics Quarterly, 2(1-2):83–97,
work page 2023
-
[12]
URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109
doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109. Myeongsang Lee, Joseph W. Schafer, Jeshuwin Prabakaran, Devlina Chakravarty, Madeleine F. Clore, and Lauren L. Porter. Large-scale predictions of alternative protein conformations by alphafold2- based sequence association.Nature Communications...
-
[13]
doi: 10.1038/s41467-025-60759-5
ISSN 2041-1723. doi: 10.1038/s41467-025-60759-5. URLhttp://dx.doi.org/10.1038/s41467-025-60759-5. Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y . K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Adam E. Foster, Arne Schneuing, Jigyasa Nigam, Federico Barbero,...
-
[14]
doi: 10.1126/science.adv9817. URL https://www.science.org/doi/abs/ 10.1126/science.adv9817. Xiaohan Lin, Zhenyu Chen, Yanheng Li, Xingyu Lu, Chuanliu Fan, Ziqiang Cao, Shihao Feng, Yi Qin Gao, and Jun Zhang. Protokens: A machine-learned language for compact and informative encoding of protein 3d structures.bioRxiv, 2023a. doi: 10.1101/2023.11.27.568722. U...
-
[15]
doi: 10.1038/ s41467-024-46808-5
ISSN 2041-1723. doi: 10.1038/ s41467-024-46808-5. URLhttp://dx.doi.org/10.1038/s41467-024-46808-5. Valentin Lombard, Sergei Grudinin, and Elodie Laine. Petimot: A novel framework for inferring protein motions from sparse data using se(3)-equivariant graph neural networks,
-
[16]
Ilya Loshchilov and Frank Hutter
URL https://arxiv.org/abs/2504.02839. Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts,
-
[17]
SGDR: Stochastic Gradient Descent with Warm Restarts
URL https://arxiv.org/abs/1608.03983. Jiarui Lu, Xiaoyin Chen, Stephen Zhewen Lu, Chence Shi, Hongyu Guo, Yoshua Bengio, and Jian Tang. Structure language models for protein conformation generation,
work page internal anchor Pith review Pith/arXiv arXiv
- [18]
-
[19]
URL https://www.biorxiv.org/content/ early/2026/01/22/2026.01.21.700698
doi: 10.64898/2026.01.21.700698. URL https://www.biorxiv.org/content/ early/2026/01/22/2026.01.21.700698. Antonio Mirarchi, Toni Giorgino, and Gianni De Fabritiis. mdcath: A large-scale md dataset for data- driven computational biophysics.Scientific Data, 11(1), November
-
[20]
doi: 10.1038/s41597-024-04140-z
ISSN 2052-4463. doi: 10.1038/s41597-024-04140-z. URLhttp://dx.doi.org/10.1038/s41597-024-04140-z. 12 Pascal Notin, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spin- ner, Nathan Rollins, Ada Shaw, Rose Orenbuch, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Yarin Gal, and Debora Marks. Proteingym: Large- scale be...
-
[21]
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ cac723e5ff29f65e3fcbb0739ae91bee-Paper-Datasets_and_Benchmarks.pdf. Saro Passaro, Gabriele Corso, Jeremy Wohlwend, Mateo Reveiz, Stephan Thaler, Vignesh Ram Somnath, Noah Getz, Tally Portnoi, Julien Roy, Hannes Stark, David Kwabi-Addo, Do- minique Beaini, Tommi Jaakkola, and Regina Barzilay. ...
work page 2023
-
[22]
URL https://www.biorxiv.org/content/early/2025/06/18/2025.06.14.659707
doi: 10.1101/2025.06.14.659707. URL https://www.biorxiv.org/content/early/2025/06/18/2025.06.14.659707. Nicolas Portal, Wissam Karroucha, Vincent Mallet, and Massimiliano Bonomi. Learning dynamic protein representations at scale with distograms. February
-
[23]
URLhttp://dx.doi.org/10.64898/2026.01.29.702509
doi: 10.64898/2026.01.29.702509. URLhttp://dx.doi.org/10.64898/2026.01.29.702509. Till Siebenmorgen, Filipe Menezes, Sabrina Benassou, Erinc Merdivan, Kieran Didi, André San- tos Dias Mourão, Radosław Kitel, Pietro Liò, Stefan Kesselheim, Marie Piraud, Fabian J. Theis, Michael Sattler, and Grzegorz M. Popowicz. Misato: machine learning dataset of protein–...
-
[24]
doi: 10.1038/s43588-024-00627-2
ISSN 2662-8457. doi: 10.1038/s43588-024-00627-2. URL http://dx.doi.org/10.1038/s43588-024-00627-2. Michael Sun, Weize Yuan, Gang Liu, Wojciech Matusik, and Marinka Zitnik. Protein structure tokenization via geometric byte pair encoding,
-
[25]
Neural Discrete Representation Learning
URLhttps://arxiv.org/abs/1711.00937. Michel van Kempen, Stephanie S Kim, Charlotte Tumescheit, Milot Mirdita, Cameron LM Gilchrist, Johannes Söding, and Martin Steinegger. Foldseek: fast and accurate protein structure search. Biorxiv, pages 2022–02,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
ISSN 1362-4962. doi: 10.1093/nar/gkad1084. URLhttp://dx.doi.org/10.1093/nar/gkad1084. A. Vasuki and Ponnusamy Thangapandian Vanathi. A review of vector quantization techniques. IEEE Potentials, 25:39–47,
-
[27]
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi
URLhttps://arxiv.org/abs/2503.00089. Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Sound- stream: An end-to-end neural audio codec,
-
[28]
URL https://arxiv.org/abs/2107.03312. Shuxin Zheng, Jiyan He, Chang Liu, Yu Shi, Ziheng Lu, Weitao Feng, Fusong Ju, Jiaxi Wang, Jianwei Zhu, Yaosen Min, He Zhang, Shidi Tang, Hongxia Hao, Peiran Jin, Chi Chen, Frank Noé, Haiguang Liu, and Tie-Yan Liu. Predicting equilibrium distributions for molecular systems with deep learning.Nature Machine Intelligence...
-
[29]
doi: 10.1038/s42256-024-00837-3
ISSN 2522-5839. doi: 10.1038/s42256-024-00837-3. URLhttp://dx.doi.org/10.1038/s42256-024-00837-3. 13 A Descriptor specification This appendix gives the full definition of the ENSEMBITS descriptors used as input to the RVQ-V AE. We describe two descriptor families sharing the same neighbour-selection machinery but differing in their per-neighbour feature b...
-
[30]
For a neighbour pair (i, j) we append a 4D block (sinψ i,cosψ i,sinψ j,cosψ j)
Backbone ψ dihedral (4D, optional).The backbone ψ dihedral ψr at residue r is the tor- sion angle defined by the four consecutive backbone atoms (Nr,Cα r,C r,N r+1), encoded as (sinψ r,cosψ r) to avoid the ±180◦ wrap. For a neighbour pair (i, j) we append a 4D block (sinψ i,cosψ i,sinψ j,cosψ j). Termini and residues for which the next residue’s N is unav...
work page 2025
-
[31]
The descriptor is SE(3)-invariant by construction: a global rigid-body motion of the entire structure left-multiplies every T p j by the same group element, which then cancels under the (T p r )−1 ◦T p j com- position. Real backbone N/Cα/C are required and are sourced per dataset (mdCATH h5 trajectories, MISATO MD); when only Cα is available, N and C are ...
work page 1920
-
[32]
,CK, with associated nearest-neighbor quantization operators Qℓ(ρ) = arg min e∈Cℓ ∥ρ−e∥
16 B Model Details B.1 RVQ-V AE RVQ-V AE [Zeghidour et al., 2021] is a multi-stage extension [Vasuki and Vanathi, 2006] in which a continuous embedding is approximated by a sum of K codebook entries, one drawn from each of K independently learned codebooks C1, . . . ,CK, with associated nearest-neighbor quantization operators Qℓ(ρ) = arg min e∈Cℓ ∥ρ−e∥. C...
work page 2021
-
[33]
The straight-through estimator is applied to thesummed quantized embedding q, so the encoder receives a single clean reconstruction gradient through the bottleneck while the codebooks themselves are updated separately (discussed below). Algorithm 1Residual Vector Quantization Require:z , the latent produced by the set encoder; codebooks Cℓ with quantizati...
work page 2025
-
[34]
• Residual quantizer: K= 3 levels with codebook sizes [L1, L2, L3] = [2048,128,128] , giving an addressable token space of L1 ·L 2 ·L 3 ≈3.4×10
work page 2048
-
[35]
Codebooks are updated by EMA with decay γ= 0.99
are used. Codebooks are updated by EMA with decay γ= 0.99 . Codes whose EMA usage count drops below 1 are reseeded to a uniformly sampled encoder output from the current batch. • Decoder: ndec = 3-layer MLP with hidden size 256 and GELU activations, mapping the quantized latentˆy∈R 128 toP= 10descriptor vectors. • Total trainable parameters:≈3.4M. Loss.Th...
work page 2017
-
[36]
Descriptors are standardized to zero mean and unit variance per feature using statistics computed on the training split; the same (µ, σ) are bundled with the model checkpoint for downstream inference. During training, the number of input frames is sampled uniformly from peff ∼ U {1, . . . ,10} at each step (the variable-Pschedule used by SFTD), so the enc...
work page 2048
-
[37]
20 T oken 1039 s1 =0.98 6DH1 res 85 dz=2.66 4FIV res 99 dz=4.77 1B6P res 84 dz=4.81 1FMB res 90 dz=4.84 5SY3 res 308 dz=6.19 0 1 2 3 4 frame T oken 1865 s1 =1.30 4OCX res 265 dz=3.12 5HHX res 380 dz=5.33 2NTF res 52 dz=5.33 1GSZ res 357 dz=8.32 5JIC res 363 dz=8.56 0 1 2 3 4 frame T oken 1815 s1 =2.09 5JID res 117 dz=3.36 3KGT res 345 dz=3.46 2B9A res 224...
work page 2048
-
[38]
Two observations are notable. First, the codebook achieves near-complete utilization: 2045/2048 = 99.85% of codes are assigned at least one residue across the corpus, with no dead-code clusters — the three unused codes (gray) appear as isolated points scattered among the live ones, not as a connected region. This confirms that the EMA codebook update comb...
work page 2045
-
[39]
model: k=16, P=10, descriptor=esm3desc utilization 2045/2048 = 99.9% unused (3) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log10 (1 + usage count) Figure 7: t-SNE projection of the L1 primary codebook ( M= 2048 entries, d= 128 ) for the production tokenizer (combined mdCATH + MISATO training,k= 16 , P= 10 ). Each point is one codebook entry; the color of used codes ...
work page 2045
-
[40]
Comparison to existing structural tokenizers.We run the same η2 test using each baseline tokenizer’s per-residue token assignment as the grouping variable, on the same ∼302k mdCATH- div residues (Table 5). ENSEMBITS has the strongest amplitude conditioning by a wide margin: η2 = 0.371 against ≤0.128 for every other tokenizer. The gap separates into four r...
work page 2021
-
[41]
The MLP head with random protein features hits the same ceiling as the MLP withESM3struct,ProToken, Vote_3Di, or ENSEMBITS, because all of these protein representations are out-of-distribution on the CATH-H-disjoint test split and the MLP cannot extract a generalising signal from them. What remains is the MACCS-167 ligand fingerprint, which is in-distribu...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.