Variational Autoregressive Networks with probability priors

Dawid Zapolski; Piotr Bia{\l}as; Piotr Korcyl; Tomasz Stebel

arxiv: 2605.16020 · v1 · pith:ZBPMCB43new · submitted 2026-05-15 · 💻 cs.LG · cond-mat.dis-nn· hep-lat

Variational Autoregressive Networks with probability priors

Piotr Bia{\l}as , Piotr Korcyl , Tomasz Stebel , Dawid Zapolski This is my paper

Pith reviewed 2026-05-20 21:10 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnhep-lat

keywords variational autoregressive networksphysics-informed priorsIsing modelEdwards-Anderson spin glassMonte Carlo samplingcritical slowing downdiscrete spin systems

0 comments

The pith

Physics-informed priors in variational autoregressive networks reduce training burden for Ising and Edwards-Anderson spin models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that variational autoregressive networks for sampling Boltzmann distributions perform better when initialized with a prior probability distribution drawn from physical symmetries and spin-spin interactions rather than starting from a blank slate. This approach targets the problem of critical slowing down in Monte Carlo simulations near phase transitions by cutting the effort needed to learn underlying physics from data alone. Results on the Ising model and the Edwards-Anderson spin glass indicate that the networks reach accurate approximations more readily and can handle larger lattices.

Core claim

Building on strategies that embed spin-spin interactions, the authors construct a prior probability distribution from physical symmetries to serve as the starting point for training variational autoregressive networks; numerical tests on the Ising model and Edwards-Anderson spin glass demonstrate that this physics-informed initialization lowers training cost and supports simulations of larger discrete spin systems compared with generic architectures.

What carries the argument

A prior probability distribution constructed from physical symmetries and spin-spin interactions that initializes the variational autoregressive network before training on the target Boltzmann distribution.

If this is right

Training cost drops for both the ferromagnetic Ising model and the frustrated Edwards-Anderson spin glass.
Larger lattice sizes become accessible within the same computational budget.
The method still produces samples from the correct Boltzmann distribution once training completes.
The improvement holds across different temperatures, including near critical points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-construction strategy could be tested on other lattice models whose symmetries are known but whose partition functions are hard to compute.
If the prior can be made temperature-dependent, the network might require even less retraining when scanning across phases.
Hybrid schemes that combine the prior with standard Monte Carlo updates could further suppress autocorrelation times.

Load-bearing premise

A prior built from physical symmetries and interactions can be chosen so that it speeds convergence without adding biases that block accurate approximation of the target distribution.

What would settle it

Training runs on the same lattices that show no reduction in epochs or steps to reach target accuracy when the physics prior is used instead of a uniform starting distribution.

Figures

Figures reproduced from arXiv: 2605.16020 by Dawid Zapolski, Piotr Bia{\l}as, Piotr Korcyl, Tomasz Stebel.

**Figure 1.** Figure 1: Calculating the p(s27|s<27). Solid black lines indicate the spins that are already fixed. The dashed lines denote the spins that are not yet fixed and should be summed over to obtain the true conditional probability p27(s<27). The dotted lines indicate padding spins that are required for efficient t 4 order calculations. The inner contour represents the convolution kernel for t 4 order approximation. Othe… view at source ↗

**Figure 2.** Figure 2: Magnetization distribution for the values of [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: History of the training for Ising model on [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Various estimates of F for the Ising model at critical β (left) and β = 0.5 (right). Uncertainties are much smaller than the points size. introducing a parameter [24] w¯ ≡ Znis Zmc = e (Fmc−Fnis) . (47) The values of this parameter can be found in Table B.4 in Appendix B. The various estimators of F for β = βc are compared in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: History of the training for the Ising model at [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Training history for the Edwards-Anderson model on [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Different estimates of F for the Edwards-Anderson model at β = 0.6 (left) and β = 0.9 (right). 0 2000 4000 6000 8000 10000 era 0.00 0.02 0.04 0.06 0.08 ESS t0 t1 t2 t3 t4 0 2000 4000 6000 8000 10000 era 0.05 0.04 0.03 0.02 0.01 0.00 m t0 t1 t2 t3 t4 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Results for the Edwards-Anderson model on [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Monte Carlo methods are essential across diverse scientific fields, yet their efficiency is frequently hampered by critical slowing down-a sharp increase in autocorrelation times near phase transitions. Although deep learning approaches, such as neural-network-based samplers, have been proposed to alleviate this issue, they face another serious problem: the difficulty of training the models. This difficulty partially stems from the overly general nature of original machine-learning architectures, which often ignore underlying physical symmetries and force networks to relearn them from scratch. In this paper, we demonstrate that incorporating physical priors into the model significantly enhances performance. Building upon existing strategies that integrate spin-spin interactions, we propose a framework that utilizes a prior probability distribution as a starting point for training. Our results for the Ising model, as well as for the Edwards-Anderson spin glass model, suggest that moving away from `blank slate' models in favor of physics-informed priors reduces the training burden and facilitates the simulation of larger system sizes in discrete spin models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main move is to start variational autoregressive networks from a physics-derived prior instead of a blank slate, which appears to cut training time on Ising and Edwards-Anderson models.

read the letter

The central point is that feeding a prior probability distribution built from spin-spin interactions into the training of variational autoregressive networks reduces the effort needed to sample discrete spin models. They show this on the Ising model and the Edwards-Anderson spin glass, and the results suggest it opens the door to larger lattices than the usual approach allows. That is a practical step for anyone dealing with critical slowing down in Monte Carlo work. The idea builds directly on earlier attempts to bake interactions into the architecture, but treating the prior as the explicit starting distribution rather than an extra loss term is the concrete difference here. It lets the network begin closer to the target Boltzmann distribution instead of learning basic symmetries from zero. The paper does a reasonable job framing the problem and picking standard test cases that matter to statistical physics. The soft spots are around the evidence. The abstract claims gains in training burden and system size, yet the description gives no numbers on steps saved, final error on observables, or direct comparison to a baseline trained without the prior. The stress-test concern is worth checking in the full text: if the prior is added in a way that alters the stationary point of the optimization, you need explicit checks that magnetization, energy, or correlation functions still match the true distribution once training finishes. Without those, it is hard to rule out that the speedup comes from converging to a nearby but inexact distribution. This work is aimed at people already running or building neural samplers for lattice models in physics. It is not a big conceptual leap, but the implementation detail could be useful to that group. I would send it to peer review so referees can ask for the missing quantitative tables and convergence diagnostics.

Referee Report

3 major / 2 minor

Summary. The paper proposes augmenting variational autoregressive networks (VANs) with a physics-informed prior probability distribution derived from spin-spin couplings and symmetries. This prior is used as a starting point for training rather than a blank-slate initialization. The central claim is that the approach reduces training burden and enables sampling of larger system sizes for the Ising model and the Edwards-Anderson spin-glass model by incorporating domain knowledge instead of forcing the network to rediscover physical structure from scratch.

Significance. If the quantitative results hold and the final distribution remains unbiased with respect to the target Boltzmann measure, the work would provide a practical route to more efficient neural-network Monte Carlo samplers for discrete spin systems. It directly addresses the training difficulty that currently limits the scalability of autoregressive models near critical points.

major comments (3)

[Abstract and §3] Abstract and §3 (method): the claim that the prior 'serves as an effective starting point without introducing biases' is load-bearing for the headline result, yet the manuscript supplies no explicit demonstration that the converged variational distribution recovers the same observables (energy, magnetization, specific heat) to the same precision as an un-prioritized VAN after sufficient training steps. A direct comparison of fixed-point residuals or KL divergence to the target measure is required.
[§4] §4 (results): no numerical tables or figures report training curves, autocorrelation times, or error bars for the Ising and Edwards-Anderson cases. Without these metrics it is impossible to judge whether the reported facilitation of larger system sizes is genuine acceleration or an artifact of a modified stationary point.
[§2.2] §2.2 (prior construction): the precise functional form by which the prior enters the loss (additive term, multiplicative reweighting, or initialization only) must be stated with an equation; if the objective is altered, the variational bound or fixed-point equation changes and the unbiasedness argument must be re-derived.

minor comments (2)

[Abstract and §2] Notation for the prior distribution p_prior should be introduced once and used consistently; currently the abstract and method section employ slightly different phrasing.
[Figures] Figure captions should explicitly state the system sizes, inverse temperatures, and number of independent runs used to generate each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which have helped clarify several aspects of our work. We address each major comment below and have revised the manuscript to incorporate the requested clarifications and additional results.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): the claim that the prior 'serves as an effective starting point without introducing biases' is load-bearing for the headline result, yet the manuscript supplies no explicit demonstration that the converged variational distribution recovers the same observables (energy, magnetization, specific heat) to the same precision as an un-prioritized VAN after sufficient training steps. A direct comparison of fixed-point residuals or KL divergence to the target measure is required.

Authors: We agree that an explicit verification strengthens the unbiasedness claim. In the revised manuscript we add a direct comparison in §4: after full training, the prior-initialized and standard VANs yield energy, magnetization, and specific heat values that agree within statistical errors for both the Ising and Edwards-Anderson models. We also include a supplementary figure showing the KL divergence to the target Boltzmann measure versus training steps for both initializations, confirming convergence to statistically indistinguishable fixed points. revision: yes
Referee: [§4] §4 (results): no numerical tables or figures report training curves, autocorrelation times, or error bars for the Ising and Edwards-Anderson cases. Without these metrics it is impossible to judge whether the reported facilitation of larger system sizes is genuine acceleration or an artifact of a modified stationary point.

Authors: We accept that quantitative training diagnostics are necessary. The revised §4 now contains training curves for the variational loss, measured autocorrelation times of the generated samples, and error bars on all observables for both models. These data show that the prior initialization accelerates convergence without shifting the stationary distribution, thereby supporting the claim of genuine scalability improvement. revision: yes
Referee: [§2.2] §2.2 (prior construction): the precise functional form by which the prior enters the loss (additive term, multiplicative reweighting, or initialization only) must be stated with an equation; if the objective is altered, the variational bound or fixed-point equation changes and the unbiasedness argument must be re-derived.

Authors: The prior enters solely as an initialization of the network parameters and does not modify the loss. We have inserted the following explicit statement and equation in §2.2: the initial parameters are set to θ₀ = arg min_θ KL(p_prior || p_Boltzmann), after which standard variational training proceeds with the unmodified objective L(θ) = KL(q_θ || p_Boltzmann). Because the training objective remains the standard variational bound, the fixed-point equation and unbiasedness argument are unchanged from the original VAN framework. revision: yes

Circularity Check

0 steps flagged

No circularity: priors are external physical inputs, results are empirical

full rationale

The paper proposes augmenting variational autoregressive networks by initializing or biasing with a prior probability distribution constructed from spin-spin interactions and physical symmetries. This prior is an independent input derived from the target model's Hamiltonian, not defined in terms of the network's output or fitted parameters. Training proceeds by standard variational optimization toward the Boltzmann distribution; the reported gains in convergence speed and accessible system size are presented as empirical observations on Ising and Edwards-Anderson instances. No equation equates a derived quantity to the prior by construction, no 'prediction' is statistically forced from a fitted subset, and no load-bearing uniqueness theorem or ansatz is imported via self-citation. The central claim therefore remains an independent methodological proposal rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that physical priors can be effectively encoded and that they reduce training burden without compromising sampling accuracy; this is a domain assumption drawn from statistical mechanics rather than a derived result.

axioms (1)

domain assumption Physical symmetries and spin-spin interactions can be encoded into a prior probability distribution that serves as an effective initialization for the variational autoregressive network.
Invoked in the abstract when stating that incorporating physical priors enhances performance over blank-slate models.

pith-pipeline@v0.9.0 · 5708 in / 1222 out tokens · 32799 ms · 2026-05-20T21:10:29.958938+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

logit ˜p(1)27(s<27) = 2β(s19 + s26); higher-order terms up to t4 involving products of tanh(βJij) (Eqs. 33,35,37,38,48–51)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

minimizing Fq ≡ D_KL(q||p) + F via REINFORCE on autoregressive q(si|s<i)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

[1]

Journal of Chemical Physics 21, 1087–1092

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of state calculations by fast computing machines, The Journal 17 of Chemical Physics 21 (1953) 1087–1092. doi:10.1063/1.1699114. W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57 (1970) 97–109. doi:10.1093/biomet/57.1.97

work page doi:10.1063/1.1699114 1953
[2]

D. Wu, L. Wang, P. Zhang, Solving statistical mechanics using variational autoregressive networks, Phys. Rev. Lett. 122 (2019) 080602

work page 2019
[3]

M. S. Albergo, G. Kanwar, P. E. Shanahan, Flow-based generative models for markov chain monte carlo in lattice field theory, Phys. Rev. D 100 (2019) 034515. doi:10.1103/PhysRevD.100.034515

work page doi:10.1103/physrevd.100.034515 2019
[4]

Białas, P

P. Białas, P. Korcyl, T. Stebel, Analysis of autocorrelation times in neural Markov chain Monte Carlo simulations, Phys. Rev. E 107 (2023) 015303. doi:10.1103/PhysRevE.107.015303.arXiv:2111.10189

work page doi:10.1103/physreve.107.015303.arxiv:2111.10189 2023
[5]

Biazzo, The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems, Communications Physics 6 (2023) 1–10

I. Biazzo, The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems, Communications Physics 6 (2023) 1–10. doi:10.1038/S42005-023-01416-5

work page doi:10.1038/s42005-023-01416-5 2023
[6]

Biazzo, D

I. Biazzo, D. Wu, G. Carleo, Sparse autoregressive neural networks for classical spin systems, Machine Learning: Science and Technology 5 (2024) 025074. doi:10.1088/2632-2153/ad5783

work page doi:10.1088/2632-2153/ad5783 2024
[7]

Białas, P

P. Białas, P. Korcyl, T. Stebel, Hierarchical autoregressive neural net- works for statistical systems, Comput. Phys. Commun. 281 (2022) 108502. doi:10.1016/j.cpc.2022.108502.arXiv:2203.10989

work page doi:10.1016/j.cpc.2022.108502.arxiv:2203.10989 2022
[8]

Cotler and S

A. Singha, E. Cellini, K. A. Nicoli, K. Jansen, S. Kühn, S. Nakajima, Multilevel generative samplers for investigating critical phenomena, 2025. arXiv:2503.08918

work page arXiv 2025
[9]

L. M. Del Bono, F. Ricci-Tersenghi, F. Zamponi, Nearest-neighbors neural networkarchitectureforefficientsamplingofstatisticalphysicsmodels, Ma- chine Learning: Science and Technology 6 (2025) 025029. doi:10.1088/2632- 2153/adcdc1.arXiv:2407.19483

work page doi:10.1088/2632- 2025
[10]

S. F. Edwards, P. W. Anderson, Theory of spin glasses, Journal of Physics F: Metal Physics 5 (1975) 965. doi:10.1088/0305-4608/5/5/017

work page doi:10.1088/0305-4608/5/5/017 1975
[11]

Blessing, X

D. Blessing, X. Jia, J. Esslinger, F. Vargas, G. Neumann, Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling, arXiv e- prints (2024).arXiv:2406.07423

work page arXiv 2024
[12]

MADE: Masked Autoencoder for Distribution Estimation

M. Germain, K. Gregor, I. Murray, H. Larochelle, Made: Masked autoen- coder for distribution estimation, 2015.arXiv:1502.03509

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K.-R. Müller, P. Kessel, Asymptotically unbiased estimation of physical observables with neural samplers, Phys. Rev. E 101 (2020) 023304. 18

work page 2020
[14]

Białas, P

P. Białas, P. Korcyl, T. Stebel, D. Zapolski, Rényi entanglement entropy of a spin chain with generative neural networks, Phys. Rev. E 110 (2024) 044116. doi:10.1103/PhysRevE.110.044116

work page doi:10.1103/physreve.110.044116 2024
[15]

Białas, P

P. Białas, P. Korcyl, T. Stebel, Mutual information of spin systems from autoregressive neural networks, Phys. Rev. E 108 (2023) 044140. doi:10.1103/PhysRevE.108.044140

work page doi:10.1103/physreve.108.044140 2023
[16]

Bulgarelli, E

A. Bulgarelli, E. Cellini, K. Jansen, S. Kühn, A. Nada, S. Nakajima, K. A. Nicoli, M. Panero, Flow-based sampling for entanglement entropy and the machine learning of defects, Phys. Rev. Lett. 134 (2025) 151601. doi:10.1103/PhysRevLett.134.151601

work page doi:10.1103/physrevlett.134.151601 2025
[17]

Estimation of the reduced density matrix and entanglement entropies using autoregressive networks

P. Białas, P. Korcyl, T. Stebel, D. Zapolski, Estimation of the reduced density matrix and entanglement entropies using autoregressive networks, 2025.arXiv:2506.04170

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

S.-H. Li, L. Wang, Neural network renormalization group, Phys. Rev. Lett. 121 (2018) 260601. doi:10.1103/PhysRevLett.121.260601

work page doi:10.1103/physrevlett.121.260601 2018
[19]

Simulating first-order phase transition with hierarchical autoregressive net- works, Phys. Rev. E 107 (2023) 054127. doi:10.1103/PhysRevE.107.054127. arXiv:2212.04955

work page doi:10.1103/physreve.107.054127 2023
[20]

A. E. Ferdinand, M. E. Fisher, Bounded and inhomogeneous ising models. i. specific-heat anomaly of a finite lattice, Phys. Rev. 185 (1969) 832–846. doi:10.1103/PhysRev.185.832

work page doi:10.1103/physrev.185.832 1969
[21]

Kong, A note on importance sampling using standarized weights, Uni- versity of Chicago Technical Reports (1992)

A. Kong, A note on importance sampling using standarized weights, Uni- versity of Chicago Technical Reports (1992)

work page 1992
[22]

Liu, Metropolizedindependent samplingwith comparisonsto rejection sampling and importance sampling, Statistics and Computing 6 (1996) 113–119

J.S. Liu, Metropolizedindependent samplingwith comparisonsto rejection sampling and importance sampling, Statistics and Computing 6 (1996) 113–119. doi:10.1007/BF00162521/METRICS

work page doi:10.1007/bf00162521/metrics 1996
[23]

Wolff, Collective monte carlo updating for spin systems, Phys

U. Wolff, Collective monte carlo updating for spin systems, Phys. Rev. Lett. 62 (1989) 361–364. doi:10.1103/PhysRevLett.62.361

work page doi:10.1103/physrevlett.62.361 1989
[24]

K. A. Nicoli, C. J. Anders, T. Hartung, K. Jansen, P. Kessel, S. Naka- jima, Detecting and mitigating mode-collapse for flow-based sam- pling of lattice field theories, Phys. Rev. D 108 (2023) 114501. doi:10.1103/PhysRevD.108.114501

work page doi:10.1103/physrevd.108.114501 2023
[25]

Parisi, Order parameter for spin-glasses, Physical Review Letters 50 (1983) 1946

G. Parisi, Order parameter for spin-glasses, Physical Review Letters 50 (1983) 1946. doi:10.1103/PhysRevLett.50.1946

work page doi:10.1103/physrevlett.50.1946 1983
[26]

Morgenstern, K

I. Morgenstern, K. Binder, Magnetic correlations in two-dimensional spin- glasses, Phys. Rev. B 22 (1980) 288–303. doi:10.1103/PhysRevB.22.288. 19 s0 s1 s2 s3 h0 0 h0 1 h0 2 h0 3 L R e L U L R e L U L R e L U L R e L U h1 0 h1 1 h1 2 h1 3 + + + + l0 l1 l2 l3l0 l1 l2 l3 q0 q1 q2 q3 σ σ σ σ s0 s1 s2 s3 h0 0 h0 1 h0 2 h0 3 L R e L U L R e L U L R e L U L R e...

work page doi:10.1103/physrevb.22.288 1980
[27]

Hukushima, K

K. Hukushima, K. Nemoto, Exchange monte carlo method and application to spin glass simulations, Journal of the Physical Society of Japan 65 (1996) 1604–1608. doi:10.1143/JPSJ.65.1604

work page doi:10.1143/jpsj.65.1604 1996
[28]

Sampling two-dimensional spin systems with transformers

P. Białas, P. Korcyl, T. Stebel, A. Stefański, D. Zapolski, Sampling two- dimensional spin systems with transformers, 2026.arXiv:2604.27738. Appendix A. Implementation As our aim was only to provide the concept of proof for the probability priors, we used the simple architecture, consisting of two dense layers with LeakyReLuactivation function in between....

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

Journal of Chemical Physics 21, 1087–1092

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, Equation of state calculations by fast computing machines, The Journal 17 of Chemical Physics 21 (1953) 1087–1092. doi:10.1063/1.1699114. W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika 57 (1970) 97–109. doi:10.1093/biomet/57.1.97

work page doi:10.1063/1.1699114 1953

[2] [2]

D. Wu, L. Wang, P. Zhang, Solving statistical mechanics using variational autoregressive networks, Phys. Rev. Lett. 122 (2019) 080602

work page 2019

[3] [3]

M. S. Albergo, G. Kanwar, P. E. Shanahan, Flow-based generative models for markov chain monte carlo in lattice field theory, Phys. Rev. D 100 (2019) 034515. doi:10.1103/PhysRevD.100.034515

work page doi:10.1103/physrevd.100.034515 2019

[4] [4]

Białas, P

P. Białas, P. Korcyl, T. Stebel, Analysis of autocorrelation times in neural Markov chain Monte Carlo simulations, Phys. Rev. E 107 (2023) 015303. doi:10.1103/PhysRevE.107.015303.arXiv:2111.10189

work page doi:10.1103/physreve.107.015303.arxiv:2111.10189 2023

[5] [5]

Biazzo, The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems, Communications Physics 6 (2023) 1–10

I. Biazzo, The autoregressive neural network architecture of the boltzmann distribution of pairwise interacting spins systems, Communications Physics 6 (2023) 1–10. doi:10.1038/S42005-023-01416-5

work page doi:10.1038/s42005-023-01416-5 2023

[6] [6]

Biazzo, D

I. Biazzo, D. Wu, G. Carleo, Sparse autoregressive neural networks for classical spin systems, Machine Learning: Science and Technology 5 (2024) 025074. doi:10.1088/2632-2153/ad5783

work page doi:10.1088/2632-2153/ad5783 2024

[7] [7]

Białas, P

P. Białas, P. Korcyl, T. Stebel, Hierarchical autoregressive neural net- works for statistical systems, Comput. Phys. Commun. 281 (2022) 108502. doi:10.1016/j.cpc.2022.108502.arXiv:2203.10989

work page doi:10.1016/j.cpc.2022.108502.arxiv:2203.10989 2022

[8] [8]

Cotler and S

A. Singha, E. Cellini, K. A. Nicoli, K. Jansen, S. Kühn, S. Nakajima, Multilevel generative samplers for investigating critical phenomena, 2025. arXiv:2503.08918

work page arXiv 2025

[9] [9]

L. M. Del Bono, F. Ricci-Tersenghi, F. Zamponi, Nearest-neighbors neural networkarchitectureforefficientsamplingofstatisticalphysicsmodels, Ma- chine Learning: Science and Technology 6 (2025) 025029. doi:10.1088/2632- 2153/adcdc1.arXiv:2407.19483

work page doi:10.1088/2632- 2025

[10] [10]

S. F. Edwards, P. W. Anderson, Theory of spin glasses, Journal of Physics F: Metal Physics 5 (1975) 965. doi:10.1088/0305-4608/5/5/017

work page doi:10.1088/0305-4608/5/5/017 1975

[11] [11]

Blessing, X

D. Blessing, X. Jia, J. Esslinger, F. Vargas, G. Neumann, Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling, arXiv e- prints (2024).arXiv:2406.07423

work page arXiv 2024

[12] [12]

MADE: Masked Autoencoder for Distribution Estimation

M. Germain, K. Gregor, I. Murray, H. Larochelle, Made: Masked autoen- coder for distribution estimation, 2015.arXiv:1502.03509

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

K. A. Nicoli, S. Nakajima, N. Strodthoff, W. Samek, K.-R. Müller, P. Kessel, Asymptotically unbiased estimation of physical observables with neural samplers, Phys. Rev. E 101 (2020) 023304. 18

work page 2020

[14] [14]

Białas, P

P. Białas, P. Korcyl, T. Stebel, D. Zapolski, Rényi entanglement entropy of a spin chain with generative neural networks, Phys. Rev. E 110 (2024) 044116. doi:10.1103/PhysRevE.110.044116

work page doi:10.1103/physreve.110.044116 2024

[15] [15]

Białas, P

P. Białas, P. Korcyl, T. Stebel, Mutual information of spin systems from autoregressive neural networks, Phys. Rev. E 108 (2023) 044140. doi:10.1103/PhysRevE.108.044140

work page doi:10.1103/physreve.108.044140 2023

[16] [16]

Bulgarelli, E

A. Bulgarelli, E. Cellini, K. Jansen, S. Kühn, A. Nada, S. Nakajima, K. A. Nicoli, M. Panero, Flow-based sampling for entanglement entropy and the machine learning of defects, Phys. Rev. Lett. 134 (2025) 151601. doi:10.1103/PhysRevLett.134.151601

work page doi:10.1103/physrevlett.134.151601 2025

[17] [17]

Estimation of the reduced density matrix and entanglement entropies using autoregressive networks

P. Białas, P. Korcyl, T. Stebel, D. Zapolski, Estimation of the reduced density matrix and entanglement entropies using autoregressive networks, 2025.arXiv:2506.04170

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

S.-H. Li, L. Wang, Neural network renormalization group, Phys. Rev. Lett. 121 (2018) 260601. doi:10.1103/PhysRevLett.121.260601

work page doi:10.1103/physrevlett.121.260601 2018

[19] [19]

Simulating first-order phase transition with hierarchical autoregressive net- works, Phys. Rev. E 107 (2023) 054127. doi:10.1103/PhysRevE.107.054127. arXiv:2212.04955

work page doi:10.1103/physreve.107.054127 2023

[20] [20]

A. E. Ferdinand, M. E. Fisher, Bounded and inhomogeneous ising models. i. specific-heat anomaly of a finite lattice, Phys. Rev. 185 (1969) 832–846. doi:10.1103/PhysRev.185.832

work page doi:10.1103/physrev.185.832 1969

[21] [21]

Kong, A note on importance sampling using standarized weights, Uni- versity of Chicago Technical Reports (1992)

A. Kong, A note on importance sampling using standarized weights, Uni- versity of Chicago Technical Reports (1992)

work page 1992

[22] [22]

Liu, Metropolizedindependent samplingwith comparisonsto rejection sampling and importance sampling, Statistics and Computing 6 (1996) 113–119

J.S. Liu, Metropolizedindependent samplingwith comparisonsto rejection sampling and importance sampling, Statistics and Computing 6 (1996) 113–119. doi:10.1007/BF00162521/METRICS

work page doi:10.1007/bf00162521/metrics 1996

[23] [23]

Wolff, Collective monte carlo updating for spin systems, Phys

U. Wolff, Collective monte carlo updating for spin systems, Phys. Rev. Lett. 62 (1989) 361–364. doi:10.1103/PhysRevLett.62.361

work page doi:10.1103/physrevlett.62.361 1989

[24] [24]

K. A. Nicoli, C. J. Anders, T. Hartung, K. Jansen, P. Kessel, S. Naka- jima, Detecting and mitigating mode-collapse for flow-based sam- pling of lattice field theories, Phys. Rev. D 108 (2023) 114501. doi:10.1103/PhysRevD.108.114501

work page doi:10.1103/physrevd.108.114501 2023

[25] [25]

Parisi, Order parameter for spin-glasses, Physical Review Letters 50 (1983) 1946

G. Parisi, Order parameter for spin-glasses, Physical Review Letters 50 (1983) 1946. doi:10.1103/PhysRevLett.50.1946

work page doi:10.1103/physrevlett.50.1946 1983

[26] [26]

Morgenstern, K

I. Morgenstern, K. Binder, Magnetic correlations in two-dimensional spin- glasses, Phys. Rev. B 22 (1980) 288–303. doi:10.1103/PhysRevB.22.288. 19 s0 s1 s2 s3 h0 0 h0 1 h0 2 h0 3 L R e L U L R e L U L R e L U L R e L U h1 0 h1 1 h1 2 h1 3 + + + + l0 l1 l2 l3l0 l1 l2 l3 q0 q1 q2 q3 σ σ σ σ s0 s1 s2 s3 h0 0 h0 1 h0 2 h0 3 L R e L U L R e L U L R e L U L R e...

work page doi:10.1103/physrevb.22.288 1980

[27] [27]

Hukushima, K

K. Hukushima, K. Nemoto, Exchange monte carlo method and application to spin glass simulations, Journal of the Physical Society of Japan 65 (1996) 1604–1608. doi:10.1143/JPSJ.65.1604

work page doi:10.1143/jpsj.65.1604 1996

[28] [28]

Sampling two-dimensional spin systems with transformers

P. Białas, P. Korcyl, T. Stebel, A. Stefański, D. Zapolski, Sampling two- dimensional spin systems with transformers, 2026.arXiv:2604.27738. Appendix A. Implementation As our aim was only to provide the concept of proof for the probability priors, we used the simple architecture, consisting of two dense layers with LeakyReLuactivation function in between....

work page internal anchor Pith review Pith/arXiv arXiv 2026