pith. sign in

arxiv: 2605.16020 · v1 · pith:ZBPMCB43new · submitted 2026-05-15 · 💻 cs.LG · cond-mat.dis-nn· hep-lat

Variational Autoregressive Networks with probability priors

Pith reviewed 2026-05-20 21:10 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnhep-lat
keywords variational autoregressive networksphysics-informed priorsIsing modelEdwards-Anderson spin glassMonte Carlo samplingcritical slowing downdiscrete spin systems
0
0 comments X

The pith

Physics-informed priors in variational autoregressive networks reduce training burden for Ising and Edwards-Anderson spin models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that variational autoregressive networks for sampling Boltzmann distributions perform better when initialized with a prior probability distribution drawn from physical symmetries and spin-spin interactions rather than starting from a blank slate. This approach targets the problem of critical slowing down in Monte Carlo simulations near phase transitions by cutting the effort needed to learn underlying physics from data alone. Results on the Ising model and the Edwards-Anderson spin glass indicate that the networks reach accurate approximations more readily and can handle larger lattices.

Core claim

Building on strategies that embed spin-spin interactions, the authors construct a prior probability distribution from physical symmetries to serve as the starting point for training variational autoregressive networks; numerical tests on the Ising model and Edwards-Anderson spin glass demonstrate that this physics-informed initialization lowers training cost and supports simulations of larger discrete spin systems compared with generic architectures.

What carries the argument

A prior probability distribution constructed from physical symmetries and spin-spin interactions that initializes the variational autoregressive network before training on the target Boltzmann distribution.

If this is right

  • Training cost drops for both the ferromagnetic Ising model and the frustrated Edwards-Anderson spin glass.
  • Larger lattice sizes become accessible within the same computational budget.
  • The method still produces samples from the correct Boltzmann distribution once training completes.
  • The improvement holds across different temperatures, including near critical points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-construction strategy could be tested on other lattice models whose symmetries are known but whose partition functions are hard to compute.
  • If the prior can be made temperature-dependent, the network might require even less retraining when scanning across phases.
  • Hybrid schemes that combine the prior with standard Monte Carlo updates could further suppress autocorrelation times.

Load-bearing premise

A prior built from physical symmetries and interactions can be chosen so that it speeds convergence without adding biases that block accurate approximation of the target distribution.

What would settle it

Training runs on the same lattices that show no reduction in epochs or steps to reach target accuracy when the physics prior is used instead of a uniform starting distribution.

Figures

Figures reproduced from arXiv: 2605.16020 by Dawid Zapolski, Piotr Bia{\l}as, Piotr Korcyl, Tomasz Stebel.

Figure 1
Figure 1. Figure 1: Calculating the p(s27|s<27). Solid black lines indicate the spins that are already fixed. The dashed lines denote the spins that are not yet fixed and should be summed over to obtain the true conditional probability p27(s<27). The dotted lines indicate padding spins that are required for efficient t 4 order calculations. The inner contour represents the convo￾lution kernel for t 4 order approximation. Othe… view at source ↗
Figure 2
Figure 2. Figure 2: Magnetization distribution for the values of [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: History of the training for Ising model on [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Various estimates of F for the Ising model at critical β (left) and β = 0.5 (right). Uncertainties are much smaller than the points size. introducing a parameter [24] w¯ ≡ Znis Zmc = e (Fmc−Fnis) . (47) The values of this parameter can be found in Table B.4 in Appendix B. The various estimators of F for β = βc are compared in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: History of the training for the Ising model at [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training history for the Edwards-Anderson model on [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Different estimates of F for the Edwards-Anderson model at β = 0.6 (left) and β = 0.9 (right). 0 2000 4000 6000 8000 10000 era 0.00 0.02 0.04 0.06 0.08 ESS t0 t1 t2 t3 t4 0 2000 4000 6000 8000 10000 era 0.05 0.04 0.03 0.02 0.01 0.00 m t0 t1 t2 t3 t4 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Results for the Edwards-Anderson model on [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Monte Carlo methods are essential across diverse scientific fields, yet their efficiency is frequently hampered by critical slowing down-a sharp increase in autocorrelation times near phase transitions. Although deep learning approaches, such as neural-network-based samplers, have been proposed to alleviate this issue, they face another serious problem: the difficulty of training the models. This difficulty partially stems from the overly general nature of original machine-learning architectures, which often ignore underlying physical symmetries and force networks to relearn them from scratch. In this paper, we demonstrate that incorporating physical priors into the model significantly enhances performance. Building upon existing strategies that integrate spin-spin interactions, we propose a framework that utilizes a prior probability distribution as a starting point for training. Our results for the Ising model, as well as for the Edwards-Anderson spin glass model, suggest that moving away from `blank slate' models in favor of physics-informed priors reduces the training burden and facilitates the simulation of larger system sizes in discrete spin models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes augmenting variational autoregressive networks (VANs) with a physics-informed prior probability distribution derived from spin-spin couplings and symmetries. This prior is used as a starting point for training rather than a blank-slate initialization. The central claim is that the approach reduces training burden and enables sampling of larger system sizes for the Ising model and the Edwards-Anderson spin-glass model by incorporating domain knowledge instead of forcing the network to rediscover physical structure from scratch.

Significance. If the quantitative results hold and the final distribution remains unbiased with respect to the target Boltzmann measure, the work would provide a practical route to more efficient neural-network Monte Carlo samplers for discrete spin systems. It directly addresses the training difficulty that currently limits the scalability of autoregressive models near critical points.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (method): the claim that the prior 'serves as an effective starting point without introducing biases' is load-bearing for the headline result, yet the manuscript supplies no explicit demonstration that the converged variational distribution recovers the same observables (energy, magnetization, specific heat) to the same precision as an un-prioritized VAN after sufficient training steps. A direct comparison of fixed-point residuals or KL divergence to the target measure is required.
  2. [§4] §4 (results): no numerical tables or figures report training curves, autocorrelation times, or error bars for the Ising and Edwards-Anderson cases. Without these metrics it is impossible to judge whether the reported facilitation of larger system sizes is genuine acceleration or an artifact of a modified stationary point.
  3. [§2.2] §2.2 (prior construction): the precise functional form by which the prior enters the loss (additive term, multiplicative reweighting, or initialization only) must be stated with an equation; if the objective is altered, the variational bound or fixed-point equation changes and the unbiasedness argument must be re-derived.
minor comments (2)
  1. [Abstract and §2] Notation for the prior distribution p_prior should be introduced once and used consistently; currently the abstract and method section employ slightly different phrasing.
  2. [Figures] Figure captions should explicitly state the system sizes, inverse temperatures, and number of independent runs used to generate each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which have helped clarify several aspects of our work. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additional results.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method): the claim that the prior 'serves as an effective starting point without introducing biases' is load-bearing for the headline result, yet the manuscript supplies no explicit demonstration that the converged variational distribution recovers the same observables (energy, magnetization, specific heat) to the same precision as an un-prioritized VAN after sufficient training steps. A direct comparison of fixed-point residuals or KL divergence to the target measure is required.

    Authors: We agree that an explicit verification strengthens the unbiasedness claim. In the revised manuscript we add a direct comparison in §4: after full training, the prior-initialized and standard VANs yield energy, magnetization, and specific heat values that agree within statistical errors for both the Ising and Edwards-Anderson models. We also include a supplementary figure showing the KL divergence to the target Boltzmann measure versus training steps for both initializations, confirming convergence to statistically indistinguishable fixed points. revision: yes

  2. Referee: [§4] §4 (results): no numerical tables or figures report training curves, autocorrelation times, or error bars for the Ising and Edwards-Anderson cases. Without these metrics it is impossible to judge whether the reported facilitation of larger system sizes is genuine acceleration or an artifact of a modified stationary point.

    Authors: We accept that quantitative training diagnostics are necessary. The revised §4 now contains training curves for the variational loss, measured autocorrelation times of the generated samples, and error bars on all observables for both models. These data show that the prior initialization accelerates convergence without shifting the stationary distribution, thereby supporting the claim of genuine scalability improvement. revision: yes

  3. Referee: [§2.2] §2.2 (prior construction): the precise functional form by which the prior enters the loss (additive term, multiplicative reweighting, or initialization only) must be stated with an equation; if the objective is altered, the variational bound or fixed-point equation changes and the unbiasedness argument must be re-derived.

    Authors: The prior enters solely as an initialization of the network parameters and does not modify the loss. We have inserted the following explicit statement and equation in §2.2: the initial parameters are set to θ₀ = arg min_θ KL(p_prior || p_Boltzmann), after which standard variational training proceeds with the unmodified objective L(θ) = KL(q_θ || p_Boltzmann). Because the training objective remains the standard variational bound, the fixed-point equation and unbiasedness argument are unchanged from the original VAN framework. revision: yes

Circularity Check

0 steps flagged

No circularity: priors are external physical inputs, results are empirical

full rationale

The paper proposes augmenting variational autoregressive networks by initializing or biasing with a prior probability distribution constructed from spin-spin interactions and physical symmetries. This prior is an independent input derived from the target model's Hamiltonian, not defined in terms of the network's output or fitted parameters. Training proceeds by standard variational optimization toward the Boltzmann distribution; the reported gains in convergence speed and accessible system size are presented as empirical observations on Ising and Edwards-Anderson instances. No equation equates a derived quantity to the prior by construction, no 'prediction' is statistically forced from a fitted subset, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim therefore remains an independent methodological proposal rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that physical priors can be effectively encoded and that they reduce training burden without compromising sampling accuracy; this is a domain assumption drawn from statistical mechanics rather than a derived result.

axioms (1)
  • domain assumption Physical symmetries and spin-spin interactions can be encoded into a prior probability distribution that serves as an effective initialization for the variational autoregressive network.
    Invoked in the abstract when stating that incorporating physical priors enhances performance over blank-slate models.

pith-pipeline@v0.9.0 · 5708 in / 1222 out tokens · 32799 ms · 2026-05-20T21:10:29.958938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Journal of Chemical Physics 21, 1087–1092

    N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of state calculations by fast computing machines, The Journal 17 of Chemical Physics 21 (1953) 1087–1092. doi:10.1063/1.1699114. W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57 (1970) 97–109. doi:10.1093/biomet/57.1.97

  2. [2]

    D. Wu, L. Wang, P. Zhang, Solving statistical mechanics using variational autoregressive networks, Phys. Rev. Lett. 122 (2019) 080602

  3. [3]

    M. S. Albergo, G. Kanwar, P. E. Shanahan, Flow-based generative models for markov chain monte carlo in lattice field theory, Phys. Rev. D 100 (2019) 034515. doi:10.1103/PhysRevD.100.034515

  4. [4]

    Białas, P

    P. Białas, P. Korcyl, T. Stebel, Analysis of autocorrelation times in neural Markov chain Monte Carlo simulations, Phys. Rev. E 107 (2023) 015303. doi:10.1103/PhysRevE.107.015303.arXiv:2111.10189

  5. [5]

    Biazzo, The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems, Communications Physics 6 (2023) 1–10

    I. Biazzo, The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems, Communications Physics 6 (2023) 1–10. doi:10.1038/S42005-023-01416-5

  6. [6]

    Biazzo, D

    I. Biazzo, D. Wu, G. Carleo, Sparse autoregressive neural networks for classical spin systems, Machine Learning: Science and Technology 5 (2024) 025074. doi:10.1088/2632-2153/ad5783

  7. [7]

    Białas, P

    P. Białas, P. Korcyl, T. Stebel, Hierarchical autoregressive neural net- works for statistical systems, Comput. Phys. Commun. 281 (2022) 108502. doi:10.1016/j.cpc.2022.108502.arXiv:2203.10989

  8. [8]

    Cotler and S

    A. Singha, E. Cellini, K. A. Nicoli, K. Jansen, S. Kühn, S. Nakajima, Multilevel generative samplers for investigating critical phenomena, 2025. arXiv:2503.08918

  9. [9]

    L. M. Del Bono, F. Ricci-Tersenghi, F. Zamponi, Nearest-neighbors neural networkarchitectureforefficientsamplingofstatisticalphysicsmodels, Ma- chine Learning: Science and Technology 6 (2025) 025029. doi:10.1088/2632- 2153/adcdc1.arXiv:2407.19483

  10. [10]

    S. F. Edwards, P. W. Anderson, Theory of spin glasses, Journal of Physics F: Metal Physics 5 (1975) 965. doi:10.1088/0305-4608/5/5/017

  11. [11]

    Blessing, X

    D. Blessing, X. Jia, J. Esslinger, F. Vargas, G. Neumann, Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling, arXiv e- prints (2024).arXiv:2406.07423

  12. [12]

    MADE: Masked Autoencoder for Distribution Estimation

    M. Germain, K. Gregor, I. Murray, H. Larochelle, Made: Masked autoen- coder for distribution estimation, 2015.arXiv:1502.03509

  13. [13]

    K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K.-R. Müller, P. Kessel, Asymptotically unbiased estimation of physical observables with neural samplers, Phys. Rev. E 101 (2020) 023304. 18

  14. [14]

    Białas, P

    P. Białas, P. Korcyl, T. Stebel, D. Zapolski, Rényi entanglement entropy of a spin chain with generative neural networks, Phys. Rev. E 110 (2024) 044116. doi:10.1103/PhysRevE.110.044116

  15. [15]

    Białas, P

    P. Białas, P. Korcyl, T. Stebel, Mutual information of spin systems from autoregressive neural networks, Phys. Rev. E 108 (2023) 044140. doi:10.1103/PhysRevE.108.044140

  16. [16]

    Bulgarelli, E

    A. Bulgarelli, E. Cellini, K. Jansen, S. Kühn, A. Nada, S. Nakajima, K. A. Nicoli, M. Panero, Flow-based sampling for entanglement entropy and the machine learning of defects, Phys. Rev. Lett. 134 (2025) 151601. doi:10.1103/PhysRevLett.134.151601

  17. [17]

    Estimation of the reduced density matrix and entanglement entropies using autoregressive networks

    P. Białas, P. Korcyl, T. Stebel, D. Zapolski, Estimation of the reduced density matrix and entanglement entropies using autoregressive networks, 2025.arXiv:2506.04170

  18. [18]

    S.-H. Li, L. Wang, Neural network renormalization group, Phys. Rev. Lett. 121 (2018) 260601. doi:10.1103/PhysRevLett.121.260601

  19. [19]

    Simulating first-order phase transition with hierarchical autoregressive net- works, Phys. Rev. E 107 (2023) 054127. doi:10.1103/PhysRevE.107.054127. arXiv:2212.04955

  20. [20]

    A. E. Ferdinand, M. E. Fisher, Bounded and inhomogeneous ising models. i. specific-heat anomaly of a finite lattice, Phys. Rev. 185 (1969) 832–846. doi:10.1103/PhysRev.185.832

  21. [21]

    Kong, A note on importance sampling using standarized weights, Uni- versity of Chicago Technical Reports (1992)

    A. Kong, A note on importance sampling using standarized weights, Uni- versity of Chicago Technical Reports (1992)

  22. [22]

    Liu, Metropolizedindependent samplingwith comparisonsto rejection sampling and importance sampling, Statistics and Computing 6 (1996) 113–119

    J.S. Liu, Metropolizedindependent samplingwith comparisonsto rejection sampling and importance sampling, Statistics and Computing 6 (1996) 113–119. doi:10.1007/BF00162521/METRICS

  23. [23]

    Wolff, Collective monte carlo updating for spin systems, Phys

    U. Wolff, Collective monte carlo updating for spin systems, Phys. Rev. Lett. 62 (1989) 361–364. doi:10.1103/PhysRevLett.62.361

  24. [24]

    K. A. Nicoli, C. J. Anders, T. Hartung, K. Jansen, P. Kessel, S. Naka- jima, Detecting and mitigating mode-collapse for flow-based sam- pling of lattice field theories, Phys. Rev. D 108 (2023) 114501. doi:10.1103/PhysRevD.108.114501

  25. [25]

    Parisi, Order parameter for spin-glasses, Physical Review Letters 50 (1983) 1946

    G. Parisi, Order parameter for spin-glasses, Physical Review Letters 50 (1983) 1946. doi:10.1103/PhysRevLett.50.1946

  26. [26]

    Morgenstern, K

    I. Morgenstern, K. Binder, Magnetic correlations in two-dimensional spin- glasses, Phys. Rev. B 22 (1980) 288–303. doi:10.1103/PhysRevB.22.288. 19 s0 s1 s2 s3 h0 0 h0 1 h0 2 h0 3 L R e L U L R e L U L R e L U L R e L U h1 0 h1 1 h1 2 h1 3 + + + + l0 l1 l2 l3l0 l1 l2 l3 q0 q1 q2 q3 σ σ σ σ s0 s1 s2 s3 h0 0 h0 1 h0 2 h0 3 L R e L U L R e L U L R e L U L R e...

  27. [27]

    Hukushima, K

    K. Hukushima, K. Nemoto, Exchange monte carlo method and application to spin glass simulations, Journal of the Physical Society of Japan 65 (1996) 1604–1608. doi:10.1143/JPSJ.65.1604

  28. [28]

    Sampling two-dimensional spin systems with transformers

    P. Białas, P. Korcyl, T. Stebel, A. Stefański, D. Zapolski, Sampling two- dimensional spin systems with transformers, 2026.arXiv:2604.27738. Appendix A. Implementation As our aim was only to provide the concept of proof for the probability priors, we used the simple architecture, consisting of two dense layers with LeakyReLuactivation function in between....