Scaling Laws and Symmetry, Evidence from Neural Force Fields
Pith reviewed 2026-05-18 08:04 UTC · model grok-4.3
The pith
Equivariant architectures for interatomic potentials follow better power-law scaling than non-equivariant models, with higher-order representations improving the exponents further.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Equivariant architectures that leverage task symmetry scale better than non-equivariant models in learning interatomic potentials, with higher-order representations translating to better scaling exponents. The study observes clear power-law scaling with respect to data, parameters, and compute, where the exponents are architecture-dependent. Analysis also suggests that for compute-optimal training, data and model sizes should scale in tandem regardless of the architecture.
What carries the argument
Architecture-dependent power-law scaling exponents arising from equivariant versus non-equivariant neural network designs on geometric force prediction tasks.
If this is right
- Equivariant models reach target accuracy with less data or compute at large scales.
- Higher-order representations within equivariant models yield additional improvements in scaling efficiency.
- Data volume and model capacity must increase together for optimal performance independent of architecture choice.
- Explicit incorporation of symmetry reduces the effective learning difficulty compared to discovering it from data alone.
Where Pith is reading between the lines
- The advantage may generalize to other geometric or physics-based prediction tasks where symmetry is known a priori.
- Extending experiments to much larger model regimes could test whether the exponent gap persists or narrows.
- Designers of future large-scale models for symmetric domains may benefit from prioritizing built-in equivariance over purely data-driven approaches.
Load-bearing premise
The observed power-law scaling behaviors and differences in exponents between architectures will hold outside the specific datasets, model sizes, and training regimes tested.
What would settle it
Training a large non-equivariant model on substantially more data and compute until its effective scaling exponent matches or exceeds that of an equivariant counterpart would challenge the central claim.
read the original abstract
We present an empirical study in the geometric task of learning interatomic potentials, which shows equivariance matters even more at larger scales; we show a clear power-law scaling behaviour with respect to data, parameters and compute with ``architecture-dependent exponents''. In particular, we observe that equivariant architectures, which leverage task symmetry, scale better than non-equivariant models. Moreover, among equivariant architectures, higher-order representations translate to better scaling exponents. Our analysis also suggests that for compute-optimal training, the data and model sizes should scale in tandem regardless of the architecture. At a high level, these results suggest that, contrary to common belief, we should not leave it to the model to discover fundamental inductive biases such as symmetry, especially as we scale, because they change the inherent difficulty of the task and its scaling laws.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study of scaling laws for neural force fields on the task of learning interatomic potentials. It reports power-law scaling of performance with respect to data volume, parameter count, and compute, with exponents that vary by architecture. Equivariant models are shown to exhibit better scaling than non-equivariant baselines, and higher-order equivariant representations further improve the exponents. The authors conclude that symmetries should be explicitly encoded rather than discovered by the model, because they alter task difficulty and scaling behavior, and that data and model size should be scaled together for compute-optimal training.
Significance. If the central empirical findings hold, the work would be significant for geometric deep learning and physics-informed ML. It supplies quantitative evidence that equivariance improves scaling exponents in addition to absolute accuracy, supporting the design choice to hard-code task symmetries at large scales. The comparative evaluation across multiple architectures on the same geometric task is a concrete contribution to the literature on inductive biases and scaling laws.
major comments (2)
- [Scaling experiments and discussion of architecture-dependent exponents] The central claim that equivariant (and higher-order) architectures inherently improve scaling exponents by reducing task difficulty via symmetry (abstract and final paragraph) is load-bearing. The reported exponent differences could instead reflect pre-asymptotic behavior in which non-equivariant models are still expending capacity to discover symmetries within the tested data/parameter/compute windows. The manuscript should include an analysis or additional runs that test whether the observed exponent gaps persist when the scale range is extended (e.g., larger models or datasets) or when the fitting window is shifted.
- [Compute-optimal training analysis] The assertion that 'for compute-optimal training, the data and model sizes should scale in tandem regardless of the architecture' requires explicit support from the compute-optimal frontier analysis. If this conclusion rests on a single set of isoFLOPs curves, the paper should clarify how the optimal data-to-parameter ratio was determined and whether it is robust to the choice of loss metric or validation set.
minor comments (2)
- [Methods / Experimental setup] The manuscript would benefit from a clearer description of the datasets (size, diversity, train/validation/test splits) and the precise procedure used to fit the power-law exponents, including any regularization or range selection criteria.
- [Figures and results] Scaling plots should report the fitted exponents with uncertainty estimates (e.g., bootstrap or fit residuals) and indicate the exact data range over which each power law was fitted.
Simulated Author's Rebuttal
We are grateful to the referee for their constructive comments, which have helped us improve the clarity and robustness of our analysis. Below, we address each major comment in detail.
read point-by-point responses
-
Referee: [Scaling experiments and discussion of architecture-dependent exponents] The central claim that equivariant (and higher-order) architectures inherently improve scaling exponents by reducing task difficulty via symmetry (abstract and final paragraph) is load-bearing. The reported exponent differences could instead reflect pre-asymptotic behavior in which non-equivariant models are still expending capacity to discover symmetries within the tested data/parameter/compute windows. The manuscript should include an analysis or additional runs that test whether the observed exponent gaps persist when the scale range is extended (e.g., larger models or datasets) or when the fitting window is shifted.
Authors: We appreciate this concern regarding the possibility of pre-asymptotic effects. Our experiments span a wide range of scales, covering several orders of magnitude in both data volume and model parameters, which is typical for scaling law studies in this domain. The exponent differences between architectures are consistent across the fitted ranges. In the revised manuscript, we have added an analysis examining the scaling behavior in different sub-ranges of the data and parameter space to assess stability of the exponents. We agree that extending to substantially larger scales would provide further confirmation, but such experiments are computationally intensive and beyond the scope of the current work given available resources. We have updated the discussion to acknowledge this limitation explicitly. revision: partial
-
Referee: [Compute-optimal training analysis] The assertion that 'for compute-optimal training, the data and model sizes should scale in tandem regardless of the architecture' requires explicit support from the compute-optimal frontier analysis. If this conclusion rests on a single set of isoFLOPs curves, the paper should clarify how the optimal data-to-parameter ratio was determined and whether it is robust to the choice of loss metric or validation set.
Authors: We thank the referee for pointing out the need for more detail on this analysis. In the revised manuscript, we have expanded the relevant section to describe the procedure for determining the compute-optimal frontier: we generated isoFLOPs curves by varying data and model sizes while keeping compute fixed, then identified the optimal data-to-parameter ratio as the one that achieves the lowest validation error for a given compute budget. We performed this across multiple architectures and confirmed the tandem scaling. To address robustness, we repeated the analysis using different validation sets (e.g., held-out molecules) and observed consistent results. Regarding the loss metric, our primary metric is the mean squared error on forces, which is the standard for interatomic potential learning; we briefly note that using energy error yields qualitatively similar trends. revision: yes
Circularity Check
No circularity: empirical scaling comparisons are self-contained
full rationale
The paper reports direct experimental measurements of power-law scaling exponents for data, parameters, and compute across equivariant and non-equivariant neural force field architectures on interatomic potential tasks. These exponents are obtained by fitting observed training curves rather than being derived from any first-principles equations or self-referential definitions within the work. No load-bearing step reduces to a fitted parameter renamed as a prediction, a self-citation chain, or an ansatz smuggled via prior work; the central claim that symmetry alters scaling behavior rests on the comparative empirical results themselves, which remain falsifiable against external benchmarks and do not presuppose the target conclusion.
Axiom & Free-Parameter Ledger
free parameters (1)
- architecture-dependent scaling exponents
axioms (1)
- domain assumption Power-law relationships govern performance improvement with scale in this domain
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.