pith. sign in

arxiv: 2605.27006 · v1 · pith:A3G7GZEAnew · submitted 2026-05-26 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Sampling Data with Chains of Forward-Backward Diffusion Steps

Pith reviewed 2026-06-29 19:15 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML
keywords U-turn chainsdiffusion modelsergodicity breakingmanifold fragmentationMetropolis-Hastings correctionfeature relaxationsynthetic languagessampling methods
0
0 comments X

The pith

U-turn chains from short forward-backward diffusion steps undergo an ergodicity-breaking phase transition on fragmented manifolds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes U-turn chains as Markov chains created by repeating short forward and backward steps from a diffusion model, using Metropolis-Hastings to sample from energy-modified distributions. For synthetic languages, the minimal version of this dynamics experiences an ergodicity-breaking phase transition caused by the fragmentation of the data manifold. Ergodicity returns when the U-turn magnitude is increased. In the broken ergodicity regime, low-level features relax faster than high-level features, and this pattern reverses only when the U-turn is large enough. Experiments on natural language and images reveal slow relaxation with minimal U-turns, particularly for high-level features.

Core claim

U-turn chains are Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images.

What carries the argument

U-turn chains, Markov chains from short forward-backward diffusion steps with Metropolis-Hastings correction that stay on the learned manifold to sample energy-modified targets.

If this is right

  • Minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold.
  • Ergodicity is restored at larger U-turn magnitude.
  • In the non-ergodic regime, low-level features relax faster than high-level ones.
  • This ordering inverts only at sufficiently large U-turn magnitude.
  • Minimal U-turns relax slowly on natural language and images, especially for high-level features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results imply that diffusion-based sampling is sensitive to the scale of steps relative to manifold connectivity.
  • High-level features in deep models may require larger perturbations to mix efficiently due to manifold structure.
  • The phase transition suggests a general mechanism for understanding slow mixing in generative model sampling.
  • Extensions could involve tuning U-turn magnitude based on feature hierarchy for better sampling.

Load-bearing premise

The diffusion model accurately captures the data manifold such that short forward-backward steps remain on it and the Metropolis-Hastings correction introduces no additional bias.

What would settle it

An experiment on synthetic languages showing no phase transition in ergodicity or no inversion in relaxation ordering as U-turn magnitude increases would falsify the main claims.

Figures

Figures reproduced from arXiv: 2605.27006 by Corinna Elena Wegner, Daniel J. Korchinski, Hyunmo Kang, Matthieu Wyart, Noam Itzhak Levi.

Figure 1
Figure 1. Figure 1: Left: A single U-turn move first corrupts a sample by adding noise or masking part of the input, then reconstructs it using a trained diffusion model. The U-turn magnitude controls the size of the perturbation. Middle: Iterating U-turn moves defines a Markov chain on the learned data distribution, allowing us to study whether the chain is ergodic. Right: Schematic of the Random Hierarchy Model, where visib… view at source ↗
Figure 2
Figure 2. Figure 2: Dynamics of minimal UTMC for the RHM with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Left: Dynamics of the RHM with L = 4, s = 2, f = 0.125 < fper. For these parameters, minimal UTMC is non-ergodic, but UTMC with larger U-turn steps progressively reduce the long￾time plateau. Right: Phase diagram for s = 2, L = 8, showing the late-time plateau normalized by the standard deviation of the overlap between independent random pairs. Light pink regions are statistically indistinguishable from th… view at source ↗
Figure 4
Figure 4. Figure 4: Layer-wise relaxation across dynamical regimes. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise latent correlation and ordering inversion in text. Left: For minimal U-turn steps, all layers relax slowly, with deeper layers in the early-to-intermediate range retaining memory of the initial text for longer. Middle: Increasing the masking fraction accelerates decorrelation across layers. Right: At large masking fraction, the layer ordering inverts: deeper representations decorrelate faster th… view at source ↗
Figure 6
Figure 6. Figure 6: Layer-wise latent correlation and ordering inversion in images. Cosine correlation Cℓ(n) of ConvNeXt feature activations between the initial image and sequential U-turn states, averaged over the 20-image ImageNet validation images. Colors denote ConvNeXt feature depth, from early layers (red) to deep layers (purple); the classifier head is excluded. Left: At small U-turn magnitude, ρ ≃ 0.1, deeper visual r… view at source ↗
Figure 7
Figure 7. Figure 7: Representative sequential U-turn trajectories. Rows use the same initial ImageNet validation example and trajectory index from the latent-analysis dataset, with fixed noise fractions ρ = 0.1, 0.4, 0.8. Columns show the sequential U-turn index n. The qualitative drift accelerates as the per-step noise increases, matching the quantitative collapse of latent correlation. Observables. At each step n, we probe … view at source ↗
Figure 8
Figure 8. Figure 8: Plateau heatmaps for the RHM with s = 2 and L = 8, computed using two trajectory lengths. Left: ρd · nmax = 104 . Right: ρd · nmax = 105 . The qualitative structure of the phase diagram is stable across these two choices, indicating that the observed non-ergodic and effectively ergodic regions are not artifacts of a single finite U-turn step cutoff. B Additional language diffusion results B.1 Correlation d… view at source ↗
Figure 9
Figure 9. Figure 9: Decay of Mistral representation correlations along sequential language U-turn chains for [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Centered cosine similarity of Mistral representations after a single language U-turn step, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Perplexity of text produced by language U-turns, measured using Mistral 7B. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Centered cosine similarity of Mistral representations after a single language U-turn step [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Full layer-wise latent persistence sweep. Cosine correlation between ConvNeXt feature activations at U-turn step n and at initialization, averaged over the 20-image ImageNet validation set with the classifier head excluded. Panels show different noise fractions ρ = t/T; insets zoom into the low-correlation regime for large ρ. The sweep shows the gradual collapse of long-lived high-level memory as the U-tu… view at source ↗
Figure 14
Figure 14. Figure 14: AUC summary of image latent persistence. Left: Mean area under the cosine-survival curve for early and late ConvNeXt feature groups as a function of noise fraction. Shaded bands denote SEM across the 20 images. Right: Difference between late-feature and early-feature AUC. The dotted line marks the interpolated ordering transition for this integrated diagnostic [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: gives a complementary single-step diagnostic by extracting Cℓ(1) from the first step of the same sequential U-turn trajectories used above. This view makes clear that increasing ρ first perturbs early layers while leaving deeper representations relatively stable, and then eventually affects all recorded feature layers. The dashed line marks the zero crossing of the late-minus-early single-step gap. 0.0 0.… view at source ↗
Figure 16
Figure 16. Figure 16: Robustness of the image layer-ordering diagnostic. Ordering summaries are shown both when the classifier head is treated as the deepest output and when it is excluded. The sign and scale of the AUC and half-life gaps show that the observed transition is not an artifact of a single averaging convention or of the classifier head alone. For sequential trajectories, the output image of one U-turn becomes the … view at source ↗
read the original abstract

Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces U-turn chains: Markov chains formed by iterating short forward-backward steps from a diffusion model, each paired with a Metropolis-Hastings correction to sample from energy-modified targets while remaining on the learned data manifold. For synthetic languages, it claims that minimal U-turn dynamics exhibits an ergodicity-breaking phase transition driven by fragmentation of the data manifold, with ergodicity restored at larger U-turn magnitudes. In the non-ergodic regime, low-level features are reported to relax faster than high-level ones, with this ordering inverting only at sufficiently large U-turn magnitude. The predictions are tested on natural language and natural images, where minimal U-turns show slow relaxation (especially for high-level features from deep CNN or LLM representations), with layer-ordering inversion appearing only at large noise levels where mixing is efficient.

Significance. If the central claims hold after addressing sampling correctness, the work provides a controlled demonstration of ergodicity phase transitions and hierarchical relaxation ordering on learned manifolds, using synthetic languages as a strength for isolating manifold fragmentation effects. This could inform sampling strategies with diffusion models and highlight limitations of local dynamics in high-dimensional generative modeling.

major comments (1)
  1. [U-turn chain construction and Metropolis-Hastings correction (as described in the abstract and methods)] The phase transition, feature relaxation ordering, and all empirical results on natural data rest on the assumption that MH-corrected U-turn proposals exactly target the intended energy-modified distribution. Manifold approximation errors (unavoidable for learned diffusion models on discrete or high-dimensional data) can render proposals non-reversible or incorrectly normalized, so that the acceptance ratio fails to cancel the correct density ratio and the chain samples a distorted distribution. This makes the reported ergodicity breaking and relaxation inversion potentially artifactual rather than intrinsic to the true manifold. A theoretical analysis or direct empirical validation (e.g., via exact low-dimensional cases or bias diagnostics) of sampling correctness is required.
minor comments (1)
  1. [Abstract] The abstract states that predictions are tested on natural language and images but provides no detail on the specific relaxation metrics, controls for post-hoc parameter choices, or error bars; these should be added for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern regarding the exactness of the Metropolis-Hastings correction under manifold approximation is well-taken, and we address it directly below.

read point-by-point responses
  1. Referee: [U-turn chain construction and Metropolis-Hastings correction (as described in the abstract and methods)] The phase transition, feature relaxation ordering, and all empirical results on natural data rest on the assumption that MH-corrected U-turn proposals exactly target the intended energy-modified distribution. Manifold approximation errors (unavoidable for learned diffusion models on discrete or high-dimensional data) can render proposals non-reversible or incorrectly normalized, so that the acceptance ratio fails to cancel the correct density ratio and the chain samples a distorted distribution. This makes the reported ergodicity breaking and relaxation inversion potentially artifactual rather than intrinsic to the true manifold. A theoretical analysis or direct empirical validation (e.g., via exact low-dimensional cases or bias diagnostics) of sampling correctness is required.

    Authors: We agree that a rigorous check of sampling correctness is essential. In the synthetic-language setting the manifold is generated from an exact, known process and the diffusion model is trained to near-perfect fidelity, so the proposal distribution is reversible with respect to the data measure and the MH ratio is exact; the reported phase transition is therefore intrinsic. For the natural-data experiments we will add (i) a short theoretical paragraph stating the exact-manifold assumption under which the MH correction is valid and (ii) empirical diagnostics (acceptance-rate stability and marginal-moment matching on low-dimensional projections) that quantify residual bias. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and summary contain no equations, derivations, or self-citations. Claims about ergodicity-breaking transitions and feature relaxation orderings are presented as empirical simulation results on synthetic languages, with tests on natural data. No steps reduce by construction to fitted parameters, self-definitions, or author-prior ansatzes; the derivation chain is not visible and thus cannot be shown to collapse to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents full audit. The central claim rests on the unverified premise that the learned diffusion model defines a manifold on which short forward-backward steps remain valid proposals.

pith-pipeline@v0.9.1-grok · 5733 in / 1140 out tokens · 27701 ms · 2026-06-29T19:15:40.317568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 24 canonical work pages · 6 internal anchors

  1. [1]

    Robert and George Casella.Monte Carlo Statistical Methods

    Christian P. Robert and George Casella.Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, New York, 2 edition, 2004

  2. [2]

    M. E. J. Newman and G. T. Barkema.Monte Carlo Methods in Statistical Physics. Oxford University Press, Oxford, 1999

  3. [3]

    Academic Press, San Diego, 2 edition, 2002

    Daan Frenkel and Berend Smit.Understanding Molecular Simulation: From Algorithms to Applications. Academic Press, San Diego, 2 edition, 2002

  4. [4]

    Bernardi, Marcelo C.R

    Rafael C. Bernardi, Marcelo C.R. Melo, and Klaus Schulten. Enhanced sampling tech- niques in molecular dynamics simulations of biological systems.Biochimica et Biophys- ica Acta (BBA) - General Subjects, 1850(5):872–877, 2015. ISSN 0304-4165. doi: https: //doi.org/10.1016/j.bbagen.2014.10.019. URL https://www.sciencedirect.com/science/article/ pii/S030441...

  5. [5]

    Onuchic, Z Luthey-Schulten, and P.G

    J.N. Onuchic, Z Luthey-Schulten, and P.G. Wolynes. Theory of protein folding: the energy landscape perspective.Annual review of physical chemistry, 48, 545–600, 1997. doi: https: //doi.org/10.1146/annurev.physchem.48.1.545

  6. [6]

    Virasoro.Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 ofWorld Scientific Lecture Notes in Physics

    Marc Mézard, Giorgio Parisi, and Miguel A. Virasoro.Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 ofWorld Scientific Lecture Notes in Physics. World Scientific, Singapore, 1987

  7. [7]

    Levin, Yuval Peres, and Elizabeth L

    David A. Levin, Yuval Peres, and Elizabeth L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, Providence, RI, 2009

  8. [8]

    Equation of State Calculations by Fast Computing Machines

    Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines.The Journal of Chemical Physics, 21(6):1087–1092, 1953. doi: 10.1063/1.1699114

  9. [9]

    W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97

  10. [10]

    Bayesian learning via stochastic gradient Langevin dynamics

    Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. InProceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011

  11. [11]

    Radford M. Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors,Handbook of Markov Chain Monte Carlo, pages 113–162. Chapman & Hall/CRC, Boca Raton, FL, 2011

  12. [12]

    Swendsen and Jian-Sheng Wang

    Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters, 57(21):2607–2609, 1986. doi: 10.1103/PhysRevLett.57.2607

  13. [13]

    Earl and Michael W

    David J. Earl and Michael W. Deem. Parallel tempering: Theory, applications, and new perspectives.Physical Chemistry Chemical Physics, 7(23):3910–3916, 2005. doi: 10.1039/ B509983H

  14. [14]

    Weiss, Niru Maheswaranathan, and Surya Ganguli

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, 2015

  15. [15]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. 10

  16. [16]

    Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

  17. [17]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational Conference on Machine Learning, pages 8162–8171. PMLR, 2021

  18. [18]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

  19. [19]

    U-turn diffusion.Entropy, 27(4), 2025

    Hamidreza Behjoo and Michael Chertkov. U-turn diffusion.Entropy, 27(4), 2025. ISSN 1099-4300. doi: 10.3390/e27040343. URL https://www.mdpi.com/1099-4300/27/4/343

  20. [20]

    A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

    Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

  21. [21]

    The MIT Press, 50 edition, 1965

    Noam Chomsky.Aspects of the Theory of Syntax. The MIT Press, 50 edition, 1965. ISBN 9780262527408. URL http://www.jstor.org/stable/j.ctt17kk81z

  22. [22]

    Formal language theory: refining the chomsky hierarchy

    Gerhard Jäger and James Rogers. Formal language theory: refining the chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1598):1956–1970, 2012

  23. [23]

    JHU Press, 1996

    Ulf Grenander.Elements of pattern theory. JHU Press, 1996

  24. [24]

    Deep Learning and Hierarchal Generative Models

    Elchanan Mossel. Deep learning and hierarchal generative models, 2018. URL https://arxiv. org/abs/1612.09057

  25. [25]

    Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review

    Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017

  26. [26]

    A Provably Correct Algorithm for Deep Learning that Actually Works

    Eran Malach and Shai Shalev-Shwartz. A provably correct algorithm for deep learning that actually works, 2018. URL https://arxiv.org/abs/1803.09522

  27. [27]

    Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 48(4):1875–1897, 2020

    Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 48(4):1875–1897, 2020

  28. [28]

    Malach and S

    E. Malach and S. Shalev-Shwartz. The implications of local correlation on learning some deep functions. InAdvances in Neural Information Processing Systems, volume 33, pages 1322–1332, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 0e4ceef65add6cf21c0f3f9da53b71c0-Paper.pdf

  29. [29]

    Tomasini, Alessandro Favero, and Matthieu Wyart

    Francesco Cagnetta, Leonardo Petrini, Umberto M. Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model. Phys. Rev. X, 14:031001, Jul 2024. doi: 10.1103/PhysRevX.14.031001. URL https://link.aps. org/doi/10.1103/PhysRevX.14.031001

  30. [30]

    Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37: 83119–83163, 2024

    Francesco Cagnetta and Matthieu Wyart. Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37: 83119–83163, 2024

  31. [31]

    Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

    Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

  32. [32]

    Learning curves theory for hi- erarchically compositional data with power-law distributed features

    Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hi- erarchically compositional data with power-law distributed features. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 6149–6164. PMLR, 2025. URL https://proceedings.mlr.press/v267/ cagnetta25a.html

  33. [33]

    Probing the latent hierarchical structure of data via diffusion models

    Antonio Sclocchi, Alessandro Favero, Noam Itzhak Levi, and Matthieu Wyart. Probing the latent hierarchical structure of data via diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=0GzqVqCKns. 11

  34. [34]

    Simoncelli

    Florentin Guth, Zahra Kadkhodaie, and Eero P. Simoncelli. Learning normalized image densities via dual score matching. InAdvances in Neural Information Processing Systems, 2025

  35. [35]

    Hunt-Smith, W

    N.T. Hunt-Smith, W. Melnitchouk, F. Ringer, N. Sato, A.W. Thomas, and M.J. White. Acceler- ating markov chain monte carlo sampling with diffusion models.Computer Physics Communi- cations, 296:109059, 2024. ISSN 0010-4655. doi: https://doi.org/10.1016/j.cpc.2023.109059. URL https://www.sciencedirect.com/science/article/pii/S0010465523004046

  36. [36]

    Springer, January

    Grzegorz Rozenberg and Arto Salomaa.Handbook of Formal Languages. Springer, January

  37. [37]

    doi: 10.1007/978-3-642-59126-6

  38. [38]

    Oxford University Press, 2009

    Marc Mezard and Andrea Montanari.Information, physics, and computation. Oxford University Press, 2009

  39. [39]

    Taylor & Francis, London, 2 edition, 1994

    Dietrich Stauffer and Ammon Aharony.Introduction to Percolation Theory. Taylor & Francis, London, 2 edition, 1994

  40. [40]

    Marginal stability in structural, spin, and electron glasses

    Markus Müller and Matthieu Wyart. Marginal stability in structural, spin, and electron glasses. Annu. Rev. Condens. Matter Phys., 6(1):177–200, 2015

  41. [41]

    Dolma: an open corpus of three trillion tokens for language model pretraining research

    Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

  42. [42]

    Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389, 2025

    Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. Llada-moe: A sparse moe diffusion ...

  43. [43]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  44. [44]

    Brains and algorithms partially converge in natural language processing.Communications Biology, 5(1):134, 2022

    Charlotte Caucheteux and Jean-Rémi King. Brains and algorithms partially converge in natural language processing.Communications Biology, 5(1):134, 2022. doi: 10.1038/ s42003-022-03036-1

  45. [45]

    Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022

    Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King. Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022. doi: 10.1038/s41598-022-20460-9

  46. [46]

    Emergence of a high-dimensional abstraction phase in language transformers

    Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Lei Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=0fD3iIBhlV

  47. [47]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 12

  48. [48]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  49. [49]

    Conneau, G

    A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia,

  50. [50]

    doi: 10.18653/v1/P18-1198

    Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. URL https: //aclanthology.org/P18-1198

  51. [51]

    Tenney, D

    I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclant...

  52. [52]

    D Manning, K

    C. D Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision.Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020

  53. [53]

    Schrödinger’s tree—on syntax and neural language models

    Artur Kulmizev and Joakim Nivre. Schrödinger’s tree—on syntax and neural language models. Frontiers in Artificial Intelligence, 5:796788, 2022

  54. [54]

    Deep networks learn to parse uniform-depth context-free languages from local statistics

    Jack T. Parley, Francesco Cagnetta, and Matthieu Wyart. Deep networks learn to parse uniform- depth context-free languages from local statistics, 2026. URL https://arxiv.org/abs/2602.06065

  55. [55]

    Physics of language models: Part 1, learning hier- archical language structures.Transactions on Machine Learning Research, 2025

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hier- archical language structures.Transactions on Machine Learning Research, 2025. URL https://openreview.net/forum?id=mPQKyzkA1K

  56. [56]

    Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora. Do transformers parse while predicting the masked word? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16513–16542, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18...

  57. [57]

    How transformers learn structured data: Insights from hierarchical filtering

    Jerome Garnier-Brun, Marc Mézard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: Insights from hierarchical filtering. InInternational Conference on Machine Learning (ICML), 2025. arXiv:2408.15138

  58. [58]

    E. DeGiuli. Random language model.Phys. Rev. Lett., 122:128301, Mar 2019. doi: 10.1103/ PhysRevLett.122.128301. URL https://link.aps.org/doi/10.1103/PhysRevLett.122.128301

  59. [59]

    Unraveling Syntax: Language Modeling and the Substructure of Grammars

    Laura Ying Schulz, Daniel Mitropolsky, and Tomaso Poggio. Unraveling syntax: How lan- guage models learn context-free grammars, 2025. URL https://arxiv.org/abs/2510.02524. arXiv:2510.02524

  60. [60]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 13 A Plateau heatmaps for different choices ofn max 0 0.2 0.4 f 0 0.25 0.5 0.75 1 ρ ρd ⋅ nmax = 104 fper finv non-ergodic ergodic higher → slower ergodic higher → faster invers...