OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers

Benjamin Nachman; Ibrahim Elsharkawy; Vinicius Mikuni; Wahid Bhimji

arxiv: 2601.10791 · v2 · submitted 2026-01-15 · ⚛️ physics.chem-ph · hep-ex· physics.data-an

OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers

Ibrahim Elsharkawy , Vinicius Mikuni , Wahid Bhimji , Benjamin Nachman This is my paper

Pith reviewed 2026-05-16 13:15 UTC · model grok-4.3

classification ⚛️ physics.chem-ph hep-exphysics.data-an

keywords machine-learned interatomic potentialstransfer learningpoint-edge transformersmolecular dynamicshigh-energy physicsfoundation modelssmall moleculesattention bias

0 comments

The pith

OmniMol adapts a particle-jet foundation model into a fast, accurate machine-learned interatomic potential for small molecules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a transformer pre-trained on one billion particle jets from high-energy physics can be fine-tuned into OmniMol, a state-of-the-art MLIP for small-molecule dynamics. The adaptation keeps the same Point-Edge-Transformer architecture and interaction-matrix attention bias, which directly encodes pairwise distances or momenta into attention logits. With this transfer the model reaches high accuracy on the oMol dataset after seeing relatively few examples and runs inference faster than typical alternatives. The central demonstration is that collections of point clouds carrying physics can move between sub-nuclear and atomic scales without redesigning the network.

Core claim

OmniMol is obtained by taking Omnilearned, a PET pre-trained on diverse particle jets, and fine-tuning it on molecular data; the resulting model delivers excellent energy and force predictions on the oMol dataset even with limited fine-tuning examples, while the retained architecture produces uniquely fast inference.

What carries the argument

The interaction-matrix attention bias, which injects pairwise sub-nuclear or atomic physics directly into the transformer's attention logits to steer the network toward physically meaningful neighborhoods.

If this is right

MLIPs for new molecular systems can be obtained with far fewer labeled examples than training from scratch.
Inference cost per atom remains low enough for long-time molecular-dynamics runs on commodity hardware.
Any point-cloud dataset whose elements carry pairwise physical quantities becomes a candidate for the same transfer recipe.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bias mechanism could be applied to other point-cloud domains such as protein backbones or material defect clusters without re-deriving attention patterns.
If the transfer gap proves small across scales, foundation models trained on collider data may become a general source of priors for any classical many-body problem whose interactions are pairwise.

Load-bearing premise

The features and attention patterns learned from particle jets carry over to atomic interactions without substantial loss of physical fidelity or need for major architectural changes.

What would settle it

A controlled experiment that trains an identical PET from random weights on the same oMol split and fine-tuning budget, then measures whether its accuracy and speed fall short of the transferred OmniMol.

Figures

Figures reproduced from arXiv: 2601.10791 by Benjamin Nachman, Ibrahim Elsharkawy, Vinicius Mikuni, Wahid Bhimji.

**Figure 2.** Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Conservative and Equivariant [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Scaling behavior for (left) energy and (right) forces of [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Scaling behavior for (left) energy and (right) forces of conservative and equivariant [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7. Scaling behavior for energy and forces with respect to model size for [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We present OmniMol, a state-of-the-art all-to-all transformer-based small molecule machine-learned interatomic potential (MLIP). OmniMol is built by adapting Omnilearned, a foundation model for particle jets found in high-energy physics (HEP) experiments such as at the Large Hadron Collider (LHC). Omnilearned is built with a Point-Edge-Transformer (PET) and pre-trained using a diverse set of one billion particle jets. It includes an interaction-matrix attention bias that injects pairwise sub-nuclear (HEP) or atomic (molecular-dynamics) physics directly into the transformer's attention logits, steering the network toward physically meaningful neighborhoods without sacrificing expressivity. We demonstrate OmniMol using the oMol dataset and find excellent performance even with relatively few examples for fine-tuning. Further, due to architectural transfer from Omnilearned, we demonstrate uniquely fast inference. This study lays the foundation for building interdisciplinary connections given datasets represented as collections of point clouds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transferring a HEP-pretrained Point-Edge Transformer to molecular MLIPs is a reasonable cross-domain idea but the abstract supplies no numbers and no ablation separates pretraining gains from the architecture itself.

read the letter

The main new element is adapting Omnilearned, a PET model pretrained on a billion particle jets, to small-molecule interatomic potentials on the oMol dataset while carrying over the interaction-matrix attention bias. That bias injects pairwise physics straight into the attention logits, which is a clean way to steer the network without losing expressivity. The paper does a solid job framing how this could support data-efficient fine-tuning and faster inference through architectural reuse across domains. The interdisciplinary angle is straightforward and worth exploring for anyone building foundation models that span physics scales. The soft spots are clear and central. The abstract claims state-of-the-art performance and uniquely fast inference yet gives no metrics, baselines, error bars, or dataset statistics, so those assertions cannot be checked. More critically, there is no ablation comparing the transferred weights to an identical PET architecture trained from scratch or randomly initialized on oMol alone. Without that control, any observed improvement could come from the all-to-all design and bias term rather than the HEP pretraining. The domain shift between sub-nuclear jets and atomic interactions is also left unquantified in terms of embedding alignment or attention patterns. This is for readers working on MLIPs and transfer learning who want to see whether large HEP datasets can usefully bootstrap chemistry models. A serious referee should see it so the full results, any controls, and the actual performance numbers can be evaluated; the idea is grounded enough to merit that step even if revisions are needed to tighten the evidence.

Referee Report

2 major / 2 minor

Summary. The paper presents OmniMol, a machine-learned interatomic potential for small molecules constructed by fine-tuning the Omnilearned Point-Edge-Transformer (PET) model that was pre-trained on one billion high-energy physics particle jets. It incorporates an interaction-matrix attention bias to inject pairwise physics and claims state-of-the-art performance on the oMol dataset even with limited fine-tuning examples, together with uniquely fast inference arising from the transferred architecture.

Significance. Successful cross-domain transfer from sub-nuclear jets to atomic interactions would be notable for foundation-model approaches in molecular dynamics, but the current manuscript supplies no quantitative metrics, baselines, error bars, or ablation controls, so the significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: the central claim of 'state-of-the-art' performance and 'excellent performance even with relatively few examples' is unsupported by any numerical results, dataset statistics, baseline comparisons, or error bars, rendering the primary assertion unevaluable.
[Results] The manuscript contains no ablation that compares OmniMol (pre-trained Omnilearned weights) against an identical PET architecture initialized randomly or trained from scratch on oMol alone; without this control the benefit of HEP pretraining versus the all-to-all PET design itself remains unisolated and is load-bearing for the transfer-learning thesis.

minor comments (2)

[Abstract] The oMol dataset is referenced without any description of its size, composition, or train/validation/test splits.
[Methods] Notation for the interaction-matrix attention bias is introduced but not defined with an explicit equation or pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the current manuscript requires additional quantitative support and controls to substantiate its claims, and we will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'state-of-the-art' performance and 'excellent performance even with relatively few examples' is unsupported by any numerical results, dataset statistics, baseline comparisons, or error bars, rendering the primary assertion unevaluable.

Authors: We acknowledge that the abstract's claims are not supported by numbers in the current version. In the revised manuscript we will add a concise results summary with specific metrics (e.g., energy and force MAEs on oMol), dataset statistics, direct baseline comparisons, and error bars from repeated runs so that the performance assertions become evaluable. revision: yes
Referee: [Results] The manuscript contains no ablation that compares OmniMol (pre-trained Omnilearned weights) against an identical PET architecture initialized randomly or trained from scratch on oMol alone; without this control the benefit of HEP pretraining versus the all-to-all PET design itself remains unisolated and is load-bearing for the transfer-learning thesis.

Authors: This is a valid criticism. We will add the requested ablation study in the revised paper: we will train an identical PET model from random initialization on oMol alone and report its performance alongside the fine-tuned OmniMol results, thereby isolating the contribution of the billion-jet pre-training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results rest on empirical transfer evaluation

full rationale

The paper's central claims concern measured performance of OmniMol on the oMol dataset after fine-tuning a pre-trained Omnilearned PET model originally trained on 1B HEP jets. The interaction-matrix attention bias is an explicit architectural design choice that encodes pairwise physics by construction, but the reported accuracy, data efficiency, and inference speed are obtained from downstream evaluation on held-out molecular data rather than from any equation or parameter that is defined in terms of the target results themselves. No derivation step reduces the final metrics to the inputs by algebraic identity, fitted-parameter renaming, or a self-citation chain whose validity depends on the present paper. The transfer benefit is therefore externally falsifiable and the derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that HEP-derived attention biases remain physically meaningful when applied to atomic interactions; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption The interaction-matrix attention bias developed for sub-nuclear physics can be directly reused for atomic pairwise interactions
Invoked when the paper states that the bias injects pairwise physics into attention logits for both HEP and molecular cases.

pith-pipeline@v0.9.0 · 5485 in / 1253 out tokens · 30916 ms · 2026-05-16T13:15:10.399100+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

interaction-matrix attention bias that injects pairwise sub-nuclear (HEP) or atomic (molecular-dynamics) physics directly into the transformer’s attention logits
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

pairwise physical features f(ri,rj,...) = [ri-rj, ||ri-rj||, 1/||...||, 1/||...||², 1/||...||⁶, RBFs]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative models on phase space
hep-ph 2026-04 unverdicted novelty 8.0

Generative diffusion and flow models are constructed to remain exactly on the Lorentz-invariant massless N-particle phase space manifold during sampling for particle physics applications.
Application of a Mixture of Experts-based Foundation Model to the GlueX DIRC Detector
physics.data-an 2026-04 unverdicted novelty 6.0

A single MoE-based foundation model with transformer backbone unifies simulation, PID, and noise filtering for the GlueX DIRC detector and matches or exceeds traditional geometrical and prior deep-learning methods acr...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

themolecular encodersthat embed molecules into ⃗ xembed =⃗ xpos embed +⃗ xZ embed +⃗ xadd embed +⃗ xlocal embed

work page
[2]

thebias MLPthat transforms all pair- wise physics priors into a transformer bias f(⃗ ri, ⃗ rj, ⃗ xZ i,embed, ⃗ xZ j,embed)→B ij. 6 FIG. 5. Scaling behavior for (left) energy and (right) forces ofOmniMoldirect small and medium pre-trained and from scratch. Finetuning with ten and one hundred thousand molecules onOmniMolsmall proceeds with LoRA, 4 million a...

work page
[3]

embeddingadapting

thetask headsthat map the transformer represen- tation to energy and force predictions. a. Embedding AdaptersFinally, we introduce an "embeddingadapting"layer. Theseareaper-tokengated residual MLP placed in between the trained from scratch input encoders and the pre-trained transformers that modify learned embeddings⃗ xembed by: ⃗ x∗ embed =⃗ xembed + tan...

work page
[4]

Radovic, M

A. Radovic, M. Williams, D. Rousseau, M. Kagan, D. Bonacorsi, A. Himmel, A. Aurisano, K. Terao, and T. Wongjirad, Nature560, 41 (2018)

work page 2018
[5]

Karagiorgi, G

G. Karagiorgi, G. Kasieczka, S. Kravitz, B. Nachman, and D. Shih, Nature Reviews Physics4, 399 (2022)

work page 2022
[6]

O. A. von Lilienfeld, K.-R. Müller, and A. Tkatchenko, Nature Reviews Chemistry4, 347 (2020)

work page 2020
[7]

Behler, Chemical Reviews121, 10037 (2021)

J. Behler, Chemical Reviews121, 10037 (2021)

work page 2021
[8]

Jumper, R

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Fig- urnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein...

work page 2021
[9]

Mikuni and B

V. Mikuni and B. Nachman, Phys. Rev. D111, L051504 (2025), arXiv:2404.16091 [hep-ph]

work page arXiv 2025
[10]

Mikuni and B

V. Mikuni and B. Nachman, Phys. Rev. D111, 054015 (2025), arXiv:2502.14652 [hep-ph]

work page arXiv 2025
[11]

Bhimji, C

W. Bhimji, C. Harris, V. Mikuni, and B. Nachman, (2025), arXiv:2510.24066 [hep-ph]

work page arXiv 2025
[12]

A. J. Larkoski, I. Moult, and B. Nachman, Phys. Rept. 841, 1 (2020), arXiv:1709.04464 [hep-ph]

work page arXiv 2020
[13]

Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

A. Butteret al., SciPost Phys.7, 014 (2019), arXiv:1902.09914 [hep-ph]

work page arXiv 2019
[14]

Feickert and B

M. Feickert and B. Nachman, (2021), arXiv:2102.02770 [hep-ph]

work page arXiv 2021
[15]

J. S. Smith, O. Isayev, and A. E. Roitberg, Chemical Science8, 3192 (2017)

work page 2017
[16]

K.Yao, J.E.Herr, D.Toth, R.McIntyre, andJ.Parkhill, Chemical Science9, 2261 (2018)

work page 2018
[17]

Behler and M

J. Behler and M. Parrinello, Physical Review Letters98, 146401 (2007)

work page 2007
[18]

K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K.-R. Müller, inAd- vances in Neural Information Processing Systems, Vol. 30 (2017)

work page 2017
[19]

Gasteiger, F

J. Gasteiger, F. Becker, and S. Günnemann, inAdvances in Neural Information Processing Systems, Vol. 34 (2021) pp. 6790–6802. 9

work page 2021
[20]

Batzner, A

S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky, Nature Communications13, 2453 (2022)

work page 2022
[21]

Batatia, D

I. Batatia, D. P. Kovács, G. N. C. Simm, C. Ortner, and G. Csányi, inAdvances in Neural Information Processing Systems(2022)

work page 2022
[22]

Y.-L. Liao, B. Wood, A. Das, and T. Smidt, arXiv preprint arXiv:2306.12059 (2024)

work page arXiv 2024
[23]

139030–139053

E.QuandA.S.Krishnapriyan,inAdvances in Neural In- formation Processing Systems(2024) pp. 139030–139053

work page 2024
[24]

X. Fu, B. M. Wood, L. Barroso-Luque, D. S. Levine, M. Gao, M. Dzamba, and C. L. Zitnick, arXiv preprint arXiv:2502.12147 (2025)

work page arXiv 2025
[25]

N.MardirossianandM.Head-Gordon,MolecularPhysics 115, 2315 (2017)

work page 2017
[26]

Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G

D. S. Levine, M. Shuaibi, E. W. C. Spotte-Smith, M. G. Taylor, M. R. Hasyim, K. Michel, I. Batatia, G. Csányi, M. Dzamba, P. Eastman, N. C. Frey, X. Fu, V. Gharakhanyan, A. S. Krishnapriyan, J. A. Rackers, S. Raja, A. Rizvi, A. S. Rosen, Z. Ulissi, S. Vargas, C. L. Zitnick, S. M. Blau, and B. M. Wood, “The open molecules 2025 (omol25) dataset, evaluations...

work page arXiv 2025
[27]

Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

M. McCabe, P. Mukhopadhyay, T. Marwah, B. R.-S. Blancard, F. Rozet, C. Diaconu, L. Meyer, K. W. K. Wong, H. Sotoudeh, A. Bietti, I. Espejo, R. Fear, S. Golkar, T. Hehir, K. Hirashima, G. Krawezik, F. Lanusse, R. Morel, R. Ohana, L. Parker, M. Pettee, J. Shen, K. Cho, M. Cranmer, and S. Ho, “Walrus: A cross-domain foundation model for continuum dynam- ics,...

work page arXiv 2025
[28]

Poseidon: Efficient foundation models for PDEs

M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Moli- naro, E. de Bézenac, and S. Mishra, inAdvances in Neural Information Processing Systems, Vol. 37 (2024) arXiv:2405.19101 [cs.LG]

work page arXiv 2024
[29]

McCabe, B

M. McCabe, B. R.-S. Blancard, L. H. Parker, R. Ohana, M. Cranmer, A. Bietti, M. Eickenberg, S. Golkar, G. Krawezik, F. Lanusse, M. Pettee, T. Tesileanu, K. Cho, and S. Ho, inAdvances in Neural Information Processing Systems, Vol. 37 (2024)

work page 2024
[30]

Y. Liu, J. Sun, X. He, G. Pinney, Z. Zhang, and H. Scha- effer, arXiv preprint arXiv:2409.09811 (2024)

work page arXiv 2024
[31]

Towards a physics foundation model.arXiv preprint arXiv: 2509.13805, 2026

F. Wiesner, M. Wessling, and S. Baek, “Towards a physics foundation model,” (2025), arXiv:2509.13805 [cs.LG]

work page arXiv 2025
[32]

Omnicos- mos: Transferring particle physics knowledge across the cosmos,

V. Mikuni, I. Elsharkawy, and B. Nachman, “Omnicos- mos: Transferring particle physics knowledge across the cosmos,” (2025), arXiv:2512.24422 [astro-ph.CO]

work page arXiv 2025
[33]

Bhimji, C

W. Bhimji, C. Harris, V. Mikuni, and B. Nachman, (2025), 10.48550/arXiv.2510.24066, arXiv:2510.24066 [hep-ph]

work page doi:10.48550/arxiv.2510.24066 2025
[34]

Symbolic discovery of optimization algorithms

X. Chen, C. Liang, D. Huang, E. Real, K. Wang, Y. Liu, H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y. Lu, and Q. V. Le, “Symbolic discovery of optimization al- gorithms,” (2023), arXiv:2302.06675 [cs.LG]

work page arXiv 2023
[35]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” (2019), arXiv:1711.05101 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2019
[36]

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” (2018), arXiv:1708.07120 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low- rank adaptation of large language models,” (2021), arXiv:2106.09685 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

G. Chen, F. Liu, Z. Meng, and S. Liang, inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)(2022) arXiv:2202.07962 [cs.CL]

work page arXiv 2022
[39]

N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen,et al., Nature Machine Intelligence5, 220 (2023)

work page 2023
[40]

Transformers discover molecular structure without graph priors,

T. Kreiman, Y. Bai, F. Atieh, E. Weaver, E. Qu, and A. S. Krishnapriyan, “Transformers discover molecular structure without graph priors,” (2025), arXiv:2510.02259 [cs.LG]

work page arXiv 2025
[41]

Elhag, Arun Raja, Alex Morehead, Samuel M

A. A. Elhag, A. Raja, A. Morehead, S. M. Blau, G. M. Morris, and M. M. Bronstein, “Learning inter- atomic potentials without explicit equivariance,” (2025), arXiv:2510.00027 [cs.LG]

work page arXiv 2025
[42]

The bitter lesson,

R. S. Sutton, “The bitter lesson,”https://www. incompleteideas.net/IncIdeas/BitterLesson.html (2019), published March 13, 2019. A commonly used PDF mirror ishttps://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf

work page 2019
[43]

Learning the bitter lesson: Empirical evidence from 20 years of cvpr proceedings,

M. Yousefi and J. Collins, “Learning the bitter lesson: Empirical evidence from 20 years of cvpr proceedings,” (2024), also appears as EMNLP 2024 NLP4Science work- shop paper (per arXiv comments)., arXiv:2410.09649 [cs.CV]

work page arXiv 2024
[44]

The bitter lesson learned from 2,000+ multilingual benchmarks,

M. Wu, W. Wang, S. Liu, H. Yin, X. Wang, Y. Zhao, C. Lyu, L. Wang, W. Luo, and K. Zhang, “The bitter lesson learned from 2,000+ multilingual benchmarks,” (2025), arXiv:2504.15521 [cs.CL]

work page arXiv 2025
[45]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” (2020), arXiv:2010.11929 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[46]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. Jégou, inProceedings of the 38th Inter- national Conference on Machine Learning (ICML), Pro- ceedings of Machine Learning Research, Vol. 139 (2021)

work page 2021
[47]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” (2020), arXiv:2001.08361 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020
[48]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, inAdvances in Neural Information Processing Systems (NeurIPS)(2022) arXiv:220...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)(2022) arXiv:2106.04560 [cs.CV]

work page arXiv 2022

[1] [1]

themolecular encodersthat embed molecules into ⃗ xembed =⃗ xpos embed +⃗ xZ embed +⃗ xadd embed +⃗ xlocal embed

work page

[2] [2]

thebias MLPthat transforms all pair- wise physics priors into a transformer bias f(⃗ ri, ⃗ rj, ⃗ xZ i,embed, ⃗ xZ j,embed)→B ij. 6 FIG. 5. Scaling behavior for (left) energy and (right) forces ofOmniMoldirect small and medium pre-trained and from scratch. Finetuning with ten and one hundred thousand molecules onOmniMolsmall proceeds with LoRA, 4 million a...

work page

[3] [3]

embeddingadapting

thetask headsthat map the transformer represen- tation to energy and force predictions. a. Embedding AdaptersFinally, we introduce an "embeddingadapting"layer. Theseareaper-tokengated residual MLP placed in between the trained from scratch input encoders and the pre-trained transformers that modify learned embeddings⃗ xembed by: ⃗ x∗ embed =⃗ xembed + tan...

work page

[4] [4]

Radovic, M

A. Radovic, M. Williams, D. Rousseau, M. Kagan, D. Bonacorsi, A. Himmel, A. Aurisano, K. Terao, and T. Wongjirad, Nature560, 41 (2018)

work page 2018

[5] [5]

Karagiorgi, G

G. Karagiorgi, G. Kasieczka, S. Kravitz, B. Nachman, and D. Shih, Nature Reviews Physics4, 399 (2022)

work page 2022

[6] [6]

O. A. von Lilienfeld, K.-R. Müller, and A. Tkatchenko, Nature Reviews Chemistry4, 347 (2020)

work page 2020

[7] [7]

Behler, Chemical Reviews121, 10037 (2021)

J. Behler, Chemical Reviews121, 10037 (2021)

work page 2021

[8] [8]

Jumper, R

J. Jumper, R. Evans, A. Pritzel, T. Green, M. Fig- urnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein...

work page 2021

[9] [9]

Mikuni and B

V. Mikuni and B. Nachman, Phys. Rev. D111, L051504 (2025), arXiv:2404.16091 [hep-ph]

work page arXiv 2025

[10] [10]

Mikuni and B

V. Mikuni and B. Nachman, Phys. Rev. D111, 054015 (2025), arXiv:2502.14652 [hep-ph]

work page arXiv 2025

[11] [11]

Bhimji, C

W. Bhimji, C. Harris, V. Mikuni, and B. Nachman, (2025), arXiv:2510.24066 [hep-ph]

work page arXiv 2025

[12] [12]

A. J. Larkoski, I. Moult, and B. Nachman, Phys. Rept. 841, 1 (2020), arXiv:1709.04464 [hep-ph]

work page arXiv 2020

[13] [13]

Butter et al.,The Machine Learning landscape of top taggers,SciPost Phys.7 (2019) 014, [arXiv:1902.09914]

A. Butteret al., SciPost Phys.7, 014 (2019), arXiv:1902.09914 [hep-ph]

work page arXiv 2019

[14] [14]

Feickert and B

M. Feickert and B. Nachman, (2021), arXiv:2102.02770 [hep-ph]

work page arXiv 2021

[15] [15]

J. S. Smith, O. Isayev, and A. E. Roitberg, Chemical Science8, 3192 (2017)

work page 2017

[16] [16]

K.Yao, J.E.Herr, D.Toth, R.McIntyre, andJ.Parkhill, Chemical Science9, 2261 (2018)

work page 2018

[17] [17]

Behler and M

J. Behler and M. Parrinello, Physical Review Letters98, 146401 (2007)

work page 2007

[18] [18]

K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K.-R. Müller, inAd- vances in Neural Information Processing Systems, Vol. 30 (2017)

work page 2017

[19] [19]

Gasteiger, F

J. Gasteiger, F. Becker, and S. Günnemann, inAdvances in Neural Information Processing Systems, Vol. 34 (2021) pp. 6790–6802. 9

work page 2021

[20] [20]

Batzner, A

S. Batzner, A. Musaelian, L. Sun, M. Geiger, J. P. Mailoa, M. Kornbluth, N. Molinari, T. E. Smidt, and B. Kozinsky, Nature Communications13, 2453 (2022)

work page 2022

[21] [21]

Batatia, D

I. Batatia, D. P. Kovács, G. N. C. Simm, C. Ortner, and G. Csányi, inAdvances in Neural Information Processing Systems(2022)

work page 2022

[22] [22]

Y.-L. Liao, B. Wood, A. Das, and T. Smidt, arXiv preprint arXiv:2306.12059 (2024)

work page arXiv 2024

[23] [23]

139030–139053

E.QuandA.S.Krishnapriyan,inAdvances in Neural In- formation Processing Systems(2024) pp. 139030–139053

work page 2024

[24] [24]

X. Fu, B. M. Wood, L. Barroso-Luque, D. S. Levine, M. Gao, M. Dzamba, and C. L. Zitnick, arXiv preprint arXiv:2502.12147 (2025)

work page arXiv 2025

[25] [25]

N.MardirossianandM.Head-Gordon,MolecularPhysics 115, 2315 (2017)

work page 2017

[26] [26]

Levine, Muhammed Shuaibi, Evan Walter Clark Spotte-Smith, Michael G

D. S. Levine, M. Shuaibi, E. W. C. Spotte-Smith, M. G. Taylor, M. R. Hasyim, K. Michel, I. Batatia, G. Csányi, M. Dzamba, P. Eastman, N. C. Frey, X. Fu, V. Gharakhanyan, A. S. Krishnapriyan, J. A. Rackers, S. Raja, A. Rizvi, A. S. Rosen, Z. Ulissi, S. Vargas, C. L. Zitnick, S. M. Blau, and B. M. Wood, “The open molecules 2025 (omol25) dataset, evaluations...

work page arXiv 2025

[27] [27]

Walrus: A cross-domain foundation model for continuum dynamics.arXiv preprint arXiv:2511.15684, 2025

M. McCabe, P. Mukhopadhyay, T. Marwah, B. R.-S. Blancard, F. Rozet, C. Diaconu, L. Meyer, K. W. K. Wong, H. Sotoudeh, A. Bietti, I. Espejo, R. Fear, S. Golkar, T. Hehir, K. Hirashima, G. Krawezik, F. Lanusse, R. Morel, R. Ohana, L. Parker, M. Pettee, J. Shen, K. Cho, M. Cranmer, and S. Ho, “Walrus: A cross-domain foundation model for continuum dynam- ics,...

work page arXiv 2025

[28] [28]

Poseidon: Efficient foundation models for PDEs

M. Herde, B. Raonić, T. Rohner, R. Käppeli, R. Moli- naro, E. de Bézenac, and S. Mishra, inAdvances in Neural Information Processing Systems, Vol. 37 (2024) arXiv:2405.19101 [cs.LG]

work page arXiv 2024

[29] [29]

McCabe, B

M. McCabe, B. R.-S. Blancard, L. H. Parker, R. Ohana, M. Cranmer, A. Bietti, M. Eickenberg, S. Golkar, G. Krawezik, F. Lanusse, M. Pettee, T. Tesileanu, K. Cho, and S. Ho, inAdvances in Neural Information Processing Systems, Vol. 37 (2024)

work page 2024

[30] [30]

Y. Liu, J. Sun, X. He, G. Pinney, Z. Zhang, and H. Scha- effer, arXiv preprint arXiv:2409.09811 (2024)

work page arXiv 2024

[31] [31]

Towards a physics foundation model.arXiv preprint arXiv: 2509.13805, 2026

F. Wiesner, M. Wessling, and S. Baek, “Towards a physics foundation model,” (2025), arXiv:2509.13805 [cs.LG]

work page arXiv 2025

[32] [32]

Omnicos- mos: Transferring particle physics knowledge across the cosmos,

V. Mikuni, I. Elsharkawy, and B. Nachman, “Omnicos- mos: Transferring particle physics knowledge across the cosmos,” (2025), arXiv:2512.24422 [astro-ph.CO]

work page arXiv 2025

[33] [33]

Bhimji, C

W. Bhimji, C. Harris, V. Mikuni, and B. Nachman, (2025), 10.48550/arXiv.2510.24066, arXiv:2510.24066 [hep-ph]

work page doi:10.48550/arxiv.2510.24066 2025

[34] [34]

Symbolic discovery of optimization algorithms

X. Chen, C. Liang, D. Huang, E. Real, K. Wang, Y. Liu, H. Pham, X. Dong, T. Luong, C.-J. Hsieh, Y. Lu, and Q. V. Le, “Symbolic discovery of optimization al- gorithms,” (2023), arXiv:2302.06675 [cs.LG]

work page arXiv 2023

[35] [35]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” (2019), arXiv:1711.05101 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2019

[36] [36]

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

L. N. Smith and N. Topin, “Super-convergence: Very fast training of neural networks using large learning rates,” (2018), arXiv:1708.07120 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

LoRA: Low-Rank Adaptation of Large Language Models

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low- rank adaptation of large language models,” (2021), arXiv:2106.09685 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

G. Chen, F. Liu, Z. Meng, and S. Liang, inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)(2022) arXiv:2202.07962 [cs.CL]

work page arXiv 2022

[39] [39]

N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen,et al., Nature Machine Intelligence5, 220 (2023)

work page 2023

[40] [40]

Transformers discover molecular structure without graph priors,

T. Kreiman, Y. Bai, F. Atieh, E. Weaver, E. Qu, and A. S. Krishnapriyan, “Transformers discover molecular structure without graph priors,” (2025), arXiv:2510.02259 [cs.LG]

work page arXiv 2025

[41] [41]

Elhag, Arun Raja, Alex Morehead, Samuel M

A. A. Elhag, A. Raja, A. Morehead, S. M. Blau, G. M. Morris, and M. M. Bronstein, “Learning inter- atomic potentials without explicit equivariance,” (2025), arXiv:2510.00027 [cs.LG]

work page arXiv 2025

[42] [42]

The bitter lesson,

R. S. Sutton, “The bitter lesson,”https://www. incompleteideas.net/IncIdeas/BitterLesson.html (2019), published March 13, 2019. A commonly used PDF mirror ishttps://www.cs.utexas.edu/~eunsol/ courses/data/bitter_lesson.pdf

work page 2019

[43] [43]

Learning the bitter lesson: Empirical evidence from 20 years of cvpr proceedings,

M. Yousefi and J. Collins, “Learning the bitter lesson: Empirical evidence from 20 years of cvpr proceedings,” (2024), also appears as EMNLP 2024 NLP4Science work- shop paper (per arXiv comments)., arXiv:2410.09649 [cs.CV]

work page arXiv 2024

[44] [44]

The bitter lesson learned from 2,000+ multilingual benchmarks,

M. Wu, W. Wang, S. Liu, H. Yin, X. Wang, Y. Zhao, C. Lyu, L. Wang, W. Luo, and K. Zhang, “The bitter lesson learned from 2,000+ multilingual benchmarks,” (2025), arXiv:2504.15521 [cs.CL]

work page arXiv 2025

[45] [45]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” (2020), arXiv:2010.11929 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[46] [46]

Touvron, M

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. Jégou, inProceedings of the 38th Inter- national Conference on Machine Learning (ICML), Pro- ceedings of Machine Learning Research, Vol. 139 (2021)

work page 2021

[47] [47]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” (2020), arXiv:2001.08361 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2020

[48] [48]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, inAdvances in Neural Information Processing Systems (NeurIPS)(2022) arXiv:220...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer, inProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)(2022) arXiv:2106.04560 [cs.CV]

work page arXiv 2022