Sampling Data with Chains of Forward-Backward Diffusion Steps

Corinna Elena Wegner; Daniel J. Korchinski; Hyunmo Kang; Matthieu Wyart; Noam Itzhak Levi

arxiv: 2605.27006 · v1 · pith:A3G7GZEAnew · submitted 2026-05-26 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Sampling Data with Chains of Forward-Backward Diffusion Steps

Hyunmo Kang , Noam Itzhak Levi , Corinna Elena Wegner , Daniel J. Korchinski , Matthieu Wyart This is my paper

Pith reviewed 2026-06-29 19:15 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML

keywords U-turn chainsdiffusion modelsergodicity breakingmanifold fragmentationMetropolis-Hastings correctionfeature relaxationsynthetic languagessampling methods

0 comments

The pith

U-turn chains from short forward-backward diffusion steps undergo an ergodicity-breaking phase transition on fragmented manifolds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes U-turn chains as Markov chains created by repeating short forward and backward steps from a diffusion model, using Metropolis-Hastings to sample from energy-modified distributions. For synthetic languages, the minimal version of this dynamics experiences an ergodicity-breaking phase transition caused by the fragmentation of the data manifold. Ergodicity returns when the U-turn magnitude is increased. In the broken ergodicity regime, low-level features relax faster than high-level features, and this pattern reverses only when the U-turn is large enough. Experiments on natural language and images reveal slow relaxation with minimal U-turns, particularly for high-level features.

Core claim

U-turn chains are Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images.

What carries the argument

U-turn chains, Markov chains from short forward-backward diffusion steps with Metropolis-Hastings correction that stay on the learned manifold to sample energy-modified targets.

If this is right

Minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold.
Ergodicity is restored at larger U-turn magnitude.
In the non-ergodic regime, low-level features relax faster than high-level ones.
This ordering inverts only at sufficiently large U-turn magnitude.
Minimal U-turns relax slowly on natural language and images, especially for high-level features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The results imply that diffusion-based sampling is sensitive to the scale of steps relative to manifold connectivity.
High-level features in deep models may require larger perturbations to mix efficiently due to manifold structure.
The phase transition suggests a general mechanism for understanding slow mixing in generative model sampling.
Extensions could involve tuning U-turn magnitude based on feature hierarchy for better sampling.

Load-bearing premise

The diffusion model accurately captures the data manifold such that short forward-backward steps remain on it and the Metropolis-Hastings correction introduces no additional bias.

What would settle it

An experiment on synthetic languages showing no phase transition in ergodicity or no inversion in relaxation ordering as U-turn magnitude increases would falsify the main claims.

Figures

Figures reproduced from arXiv: 2605.27006 by Corinna Elena Wegner, Daniel J. Korchinski, Hyunmo Kang, Matthieu Wyart, Noam Itzhak Levi.

**Figure 1.** Figure 1: Left: A single U-turn move first corrupts a sample by adding noise or masking part of the input, then reconstructs it using a trained diffusion model. The U-turn magnitude controls the size of the perturbation. Middle: Iterating U-turn moves defines a Markov chain on the learned data distribution, allowing us to study whether the chain is ergodic. Right: Schematic of the Random Hierarchy Model, where visib… view at source ↗

**Figure 2.** Figure 2: Dynamics of minimal UTMC for the RHM with [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Left: Dynamics of the RHM with L = 4, s = 2, f = 0.125 < fper. For these parameters, minimal UTMC is non-ergodic, but UTMC with larger U-turn steps progressively reduce the longtime plateau. Right: Phase diagram for s = 2, L = 8, showing the late-time plateau normalized by the standard deviation of the overlap between independent random pairs. Light pink regions are statistically indistinguishable from th… view at source ↗

**Figure 4.** Figure 4: Layer-wise relaxation across dynamical regimes. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Layer-wise latent correlation and ordering inversion in text. Left: For minimal U-turn steps, all layers relax slowly, with deeper layers in the early-to-intermediate range retaining memory of the initial text for longer. Middle: Increasing the masking fraction accelerates decorrelation across layers. Right: At large masking fraction, the layer ordering inverts: deeper representations decorrelate faster th… view at source ↗

**Figure 6.** Figure 6: Layer-wise latent correlation and ordering inversion in images. Cosine correlation Cℓ(n) of ConvNeXt feature activations between the initial image and sequential U-turn states, averaged over the 20-image ImageNet validation images. Colors denote ConvNeXt feature depth, from early layers (red) to deep layers (purple); the classifier head is excluded. Left: At small U-turn magnitude, ρ ≃ 0.1, deeper visual r… view at source ↗

**Figure 7.** Figure 7: Representative sequential U-turn trajectories. Rows use the same initial ImageNet validation example and trajectory index from the latent-analysis dataset, with fixed noise fractions ρ = 0.1, 0.4, 0.8. Columns show the sequential U-turn index n. The qualitative drift accelerates as the per-step noise increases, matching the quantitative collapse of latent correlation. Observables. At each step n, we probe … view at source ↗

**Figure 8.** Figure 8: Plateau heatmaps for the RHM with s = 2 and L = 8, computed using two trajectory lengths. Left: ρd · nmax = 104 . Right: ρd · nmax = 105 . The qualitative structure of the phase diagram is stable across these two choices, indicating that the observed non-ergodic and effectively ergodic regions are not artifacts of a single finite U-turn step cutoff. B Additional language diffusion results B.1 Correlation d… view at source ↗

**Figure 9.** Figure 9: Decay of Mistral representation correlations along sequential language U-turn chains for [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Centered cosine similarity of Mistral representations after a single language U-turn step, [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Perplexity of text produced by language U-turns, measured using Mistral 7B. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Centered cosine similarity of Mistral representations after a single language U-turn step [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Full layer-wise latent persistence sweep. Cosine correlation between ConvNeXt feature activations at U-turn step n and at initialization, averaged over the 20-image ImageNet validation set with the classifier head excluded. Panels show different noise fractions ρ = t/T; insets zoom into the low-correlation regime for large ρ. The sweep shows the gradual collapse of long-lived high-level memory as the U-tu… view at source ↗

**Figure 14.** Figure 14: AUC summary of image latent persistence. Left: Mean area under the cosine-survival curve for early and late ConvNeXt feature groups as a function of noise fraction. Shaded bands denote SEM across the 20 images. Right: Difference between late-feature and early-feature AUC. The dotted line marks the interpolated ordering transition for this integrated diagnostic [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: gives a complementary single-step diagnostic by extracting Cℓ(1) from the first step of the same sequential U-turn trajectories used above. This view makes clear that increasing ρ first perturbs early layers while leaving deeper representations relatively stable, and then eventually affects all recorded feature layers. The dashed line marks the zero crossing of the late-minus-early single-step gap. 0.0 0.… view at source ↗

**Figure 16.** Figure 16: Robustness of the image layer-ordering diagnostic. Ordering summaries are shown both when the classifier head is treated as the deepest output and when it is excluded. The sign and scale of the AUC and half-life gaps show that the observed transition is not an artifact of a single averaging convention or of the classifier head alone. For sequential trajectories, the output image of one U-turn becomes the … view at source ↗

read the original abstract

Sampling from learned high-dimensional distributions is a foundational computational problem. We introduce U-turn chains: Markov chains obtained by iterating short forward-backward steps of a diffusion model, in which each step proposes a move that remains on the learned data manifold and, paired with a Metropolis-Hastings correction, samples from energy-modified targets. For synthetic languages, we show that minimal U-turn dynamics undergoes an ergodicity-breaking phase transition driven by fragmentation of the data manifold; ergodicity is restored at larger U-turn magnitude. In the non-ergodic regime, low-level features relax faster than high-level ones, an ordering that inverts only at sufficiently large U-turn magnitude. We test these predictions on natural language and natural images. In both modalities, minimal U-turns relax slowly, especially for high-level features approximated by deep representations in CNNs or LLMs. The layer-ordering inversion appears only at large noise when mixing is efficient -- signatures consistent with strongly constrained, weakly mixing local dynamics. We discuss the implications of these results for sampling with diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

U-turn chains are a straightforward diffusion-MCMC hybrid that reports an ergodicity transition on synthetic manifolds, but the MH correction's reliability under approximation errors is the main open question.

read the letter

The main takeaway is that this paper defines U-turn chains as iterated short forward-backward diffusion steps plus Metropolis-Hastings correction to target energy-modified distributions, then shows that minimal versions undergo an ergodicity-breaking transition on fragmented synthetic manifolds, with low-level features relaxing faster than high-level ones until larger steps restore mixing. They check the pattern on natural language and images and find slow high-level mixing in the minimal regime.

The construction itself is new enough relative to standard diffusion sampling and the empirical consistency across synthetic and real modalities is useful. It gives a concrete way to think about why diffusion chains can get stuck on semantic features.

The soft spot is the sampling validity. The claims rest on the MH step exactly correcting to the intended target, yet any manifold approximation error from the learned diffusion model can break reversibility or normalization in the acceptance ratio. That could make the reported transition and ordering inversion an artifact of biased dynamics rather than a property of the true manifold. The abstract gives no equations or bias checks, so it is impossible to tell how they handled this. The stress-test note flags exactly this issue, and it lands because the central results depend on unbiased sampling.

This is for people working on diffusion sampling and MCMC hybrids. A reader who cares about mixing times in generative models would get something from the empirical observations. It deserves a serious referee because the U-turn idea and the reported phase behavior are concrete enough to review in detail, even if the theory section needs more work on error control.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces U-turn chains: Markov chains formed by iterating short forward-backward steps from a diffusion model, each paired with a Metropolis-Hastings correction to sample from energy-modified targets while remaining on the learned data manifold. For synthetic languages, it claims that minimal U-turn dynamics exhibits an ergodicity-breaking phase transition driven by fragmentation of the data manifold, with ergodicity restored at larger U-turn magnitudes. In the non-ergodic regime, low-level features are reported to relax faster than high-level ones, with this ordering inverting only at sufficiently large U-turn magnitude. The predictions are tested on natural language and natural images, where minimal U-turns show slow relaxation (especially for high-level features from deep CNN or LLM representations), with layer-ordering inversion appearing only at large noise levels where mixing is efficient.

Significance. If the central claims hold after addressing sampling correctness, the work provides a controlled demonstration of ergodicity phase transitions and hierarchical relaxation ordering on learned manifolds, using synthetic languages as a strength for isolating manifold fragmentation effects. This could inform sampling strategies with diffusion models and highlight limitations of local dynamics in high-dimensional generative modeling.

major comments (1)

[U-turn chain construction and Metropolis-Hastings correction (as described in the abstract and methods)] The phase transition, feature relaxation ordering, and all empirical results on natural data rest on the assumption that MH-corrected U-turn proposals exactly target the intended energy-modified distribution. Manifold approximation errors (unavoidable for learned diffusion models on discrete or high-dimensional data) can render proposals non-reversible or incorrectly normalized, so that the acceptance ratio fails to cancel the correct density ratio and the chain samples a distorted distribution. This makes the reported ergodicity breaking and relaxation inversion potentially artifactual rather than intrinsic to the true manifold. A theoretical analysis or direct empirical validation (e.g., via exact low-dimensional cases or bias diagnostics) of sampling correctness is required.

minor comments (1)

[Abstract] The abstract states that predictions are tested on natural language and images but provides no detail on the specific relaxation metrics, controls for post-hoc parameter choices, or error bars; these should be added for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern regarding the exactness of the Metropolis-Hastings correction under manifold approximation is well-taken, and we address it directly below.

read point-by-point responses

Referee: [U-turn chain construction and Metropolis-Hastings correction (as described in the abstract and methods)] The phase transition, feature relaxation ordering, and all empirical results on natural data rest on the assumption that MH-corrected U-turn proposals exactly target the intended energy-modified distribution. Manifold approximation errors (unavoidable for learned diffusion models on discrete or high-dimensional data) can render proposals non-reversible or incorrectly normalized, so that the acceptance ratio fails to cancel the correct density ratio and the chain samples a distorted distribution. This makes the reported ergodicity breaking and relaxation inversion potentially artifactual rather than intrinsic to the true manifold. A theoretical analysis or direct empirical validation (e.g., via exact low-dimensional cases or bias diagnostics) of sampling correctness is required.

Authors: We agree that a rigorous check of sampling correctness is essential. In the synthetic-language setting the manifold is generated from an exact, known process and the diffusion model is trained to near-perfect fidelity, so the proposal distribution is reversible with respect to the data measure and the MH ratio is exact; the reported phase transition is therefore intrinsic. For the natural-data experiments we will add (i) a short theoretical paragraph stating the exact-manifold assumption under which the MH correction is valid and (ii) empirical diagnostics (acceptance-rate stability and marginal-moment matching on low-dimensional projections) that quantify residual bias. These additions will be included in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and summary contain no equations, derivations, or self-citations. Claims about ergodicity-breaking transitions and feature relaxation orderings are presented as empirical simulation results on synthetic languages, with tests on natural data. No steps reduce by construction to fitted parameters, self-definitions, or author-prior ansatzes; the derivation chain is not visible and thus cannot be shown to collapse to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents full audit. The central claim rests on the unverified premise that the learned diffusion model defines a manifold on which short forward-backward steps remain valid proposals.

pith-pipeline@v0.9.1-grok · 5733 in / 1140 out tokens · 27701 ms · 2026-06-29T19:15:40.317568+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 24 canonical work pages · 6 internal anchors

[1]

Robert and George Casella.Monte Carlo Statistical Methods

Christian P. Robert and George Casella.Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, New York, 2 edition, 2004

2004
[2]

M. E. J. Newman and G. T. Barkema.Monte Carlo Methods in Statistical Physics. Oxford University Press, Oxford, 1999

1999
[3]

Academic Press, San Diego, 2 edition, 2002

Daan Frenkel and Berend Smit.Understanding Molecular Simulation: From Algorithms to Applications. Academic Press, San Diego, 2 edition, 2002

2002
[4]

Bernardi, Marcelo C.R

Rafael C. Bernardi, Marcelo C.R. Melo, and Klaus Schulten. Enhanced sampling tech- niques in molecular dynamics simulations of biological systems.Biochimica et Biophys- ica Acta (BBA) - General Subjects, 1850(5):872–877, 2015. ISSN 0304-4165. doi: https: //doi.org/10.1016/j.bbagen.2014.10.019. URL https://www.sciencedirect.com/science/article/ pii/S030441...

work page doi:10.1016/j.bbagen.2014.10.019 2015
[5]

Onuchic, Z Luthey-Schulten, and P.G

J.N. Onuchic, Z Luthey-Schulten, and P.G. Wolynes. Theory of protein folding: the energy landscape perspective.Annual review of physical chemistry, 48, 545–600, 1997. doi: https: //doi.org/10.1146/annurev.physchem.48.1.545

work page doi:10.1146/annurev.physchem.48.1.545 1997
[6]

Virasoro.Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 ofWorld Scientific Lecture Notes in Physics

Marc Mézard, Giorgio Parisi, and Miguel A. Virasoro.Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 ofWorld Scientific Lecture Notes in Physics. World Scientific, Singapore, 1987

1987
[7]

Levin, Yuval Peres, and Elizabeth L

David A. Levin, Yuval Peres, and Elizabeth L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, Providence, RI, 2009

2009
[8]

Equation of State Calculations by Fast Computing Machines

Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines.The Journal of Chemical Physics, 21(6):1087–1092, 1953. doi: 10.1063/1.1699114

work page doi:10.1063/1.1699114 1953
[9]

W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97

work page doi:10.1093/biomet/57.1.97 1970
[10]

Bayesian learning via stochastic gradient Langevin dynamics

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. InProceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011

2011
[11]

Radford M. Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors,Handbook of Markov Chain Monte Carlo, pages 113–162. Chapman & Hall/CRC, Boca Raton, FL, 2011

2011
[12]

Swendsen and Jian-Sheng Wang

Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters, 57(21):2607–2609, 1986. doi: 10.1103/PhysRevLett.57.2607

work page doi:10.1103/physrevlett.57.2607 1986
[13]

Earl and Michael W

David J. Earl and Michael W. Deem. Parallel tempering: Theory, applications, and new perspectives.Physical Chemistry Chemical Physics, 7(23):3910–3916, 2005. doi: 10.1039/ B509983H

2005
[14]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, 2015

2015
[15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. 10

2020
[16]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

2021
[17]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational Conference on Machine Learning, pages 8162–8171. PMLR, 2021

2021
[18]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

2021
[19]

U-turn diffusion.Entropy, 27(4), 2025

Hamidreza Behjoo and Michael Chertkov. U-turn diffusion.Entropy, 27(4), 2025. ISSN 1099-4300. doi: 10.3390/e27040343. URL https://www.mdpi.com/1099-4300/27/4/343

work page doi:10.3390/e27040343 2025
[20]

A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

2025
[21]

The MIT Press, 50 edition, 1965

Noam Chomsky.Aspects of the Theory of Syntax. The MIT Press, 50 edition, 1965. ISBN 9780262527408. URL http://www.jstor.org/stable/j.ctt17kk81z

1965
[22]

Formal language theory: refining the chomsky hierarchy

Gerhard Jäger and James Rogers. Formal language theory: refining the chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1598):1956–1970, 2012

1956
[23]

JHU Press, 1996

Ulf Grenander.Elements of pattern theory. JHU Press, 1996

1996
[24]

Deep Learning and Hierarchal Generative Models

Elchanan Mossel. Deep learning and hierarchal generative models, 2018. URL https://arxiv. org/abs/1612.09057

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review

Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017

2017
[26]

A Provably Correct Algorithm for Deep Learning that Actually Works

Eran Malach and Shai Shalev-Shwartz. A provably correct algorithm for deep learning that actually works, 2018. URL https://arxiv.org/abs/1803.09522

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 48(4):1875–1897, 2020

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 48(4):1875–1897, 2020

2020
[28]

Malach and S

E. Malach and S. Shalev-Shwartz. The implications of local correlation on learning some deep functions. InAdvances in Neural Information Processing Systems, volume 33, pages 1322–1332, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 0e4ceef65add6cf21c0f3f9da53b71c0-Paper.pdf

2020
[29]

Tomasini, Alessandro Favero, and Matthieu Wyart

Francesco Cagnetta, Leonardo Petrini, Umberto M. Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model. Phys. Rev. X, 14:031001, Jul 2024. doi: 10.1103/PhysRevX.14.031001. URL https://link.aps. org/doi/10.1103/PhysRevX.14.031001

work page doi:10.1103/physrevx.14.031001 2024
[30]

Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37: 83119–83163, 2024

Francesco Cagnetta and Matthieu Wyart. Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37: 83119–83163, 2024

2024
[31]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026
[32]

Learning curves theory for hi- erarchically compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hi- erarchically compositional data with power-law distributed features. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 6149–6164. PMLR, 2025. URL https://proceedings.mlr.press/v267/ cagnetta25a.html

2025
[33]

Probing the latent hierarchical structure of data via diffusion models

Antonio Sclocchi, Alessandro Favero, Noam Itzhak Levi, and Matthieu Wyart. Probing the latent hierarchical structure of data via diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=0GzqVqCKns. 11

2025
[34]

Simoncelli

Florentin Guth, Zahra Kadkhodaie, and Eero P. Simoncelli. Learning normalized image densities via dual score matching. InAdvances in Neural Information Processing Systems, 2025

2025
[35]

Hunt-Smith, W

N.T. Hunt-Smith, W. Melnitchouk, F. Ringer, N. Sato, A.W. Thomas, and M.J. White. Acceler- ating markov chain monte carlo sampling with diffusion models.Computer Physics Communi- cations, 296:109059, 2024. ISSN 0010-4655. doi: https://doi.org/10.1016/j.cpc.2023.109059. URL https://www.sciencedirect.com/science/article/pii/S0010465523004046

work page doi:10.1016/j.cpc.2023.109059 2024
[36]

Springer, January

Grzegorz Rozenberg and Arto Salomaa.Handbook of Formal Languages. Springer, January
[37]

doi: 10.1007/978-3-642-59126-6

work page doi:10.1007/978-3-642-59126-6
[38]

Oxford University Press, 2009

Marc Mezard and Andrea Montanari.Information, physics, and computation. Oxford University Press, 2009

2009
[39]

Taylor & Francis, London, 2 edition, 1994

Dietrich Stauffer and Ammon Aharony.Introduction to Percolation Theory. Taylor & Francis, London, 2 edition, 1994

1994
[40]

Marginal stability in structural, spin, and electron glasses

Markus Müller and Matthieu Wyart. Marginal stability in structural, spin, and electron glasses. Annu. Rev. Condens. Matter Phys., 6(1):177–200, 2015

2015
[41]

Dolma: an open corpus of three trillion tokens for language model pretraining research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

work page doi:10.18653/v1/2024.acl-long.840 2024
[42]

Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389, 2025

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. Llada-moe: A sparse moe diffusion ...

work page arXiv 2025
[43]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Brains and algorithms partially converge in natural language processing.Communications Biology, 5(1):134, 2022

Charlotte Caucheteux and Jean-Rémi King. Brains and algorithms partially converge in natural language processing.Communications Biology, 5(1):134, 2022. doi: 10.1038/ s42003-022-03036-1

2022
[45]

Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022

Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King. Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022. doi: 10.1038/s41598-022-20460-9

work page doi:10.1038/s41598-022-20460-9 2022
[46]

Emergence of a high-dimensional abstraction phase in language transformers

Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Lei Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=0fD3iIBhlV

2025
[47]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 12

2009
[48]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022
[49]

Conneau, G

A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia,
[50]

doi: 10.18653/v1/P18-1198

Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. URL https: //aclanthology.org/P18-1198

work page doi:10.18653/v1/p18-1198
[51]

Tenney, D

I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclant...

work page doi:10.18653/v1/p19-1452 2019
[52]

D Manning, K

C. D Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision.Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020

2020
[53]

Schrödinger’s tree—on syntax and neural language models

Artur Kulmizev and Joakim Nivre. Schrödinger’s tree—on syntax and neural language models. Frontiers in Artificial Intelligence, 5:796788, 2022

2022
[54]

Deep networks learn to parse uniform-depth context-free languages from local statistics

Jack T. Parley, Francesco Cagnetta, and Matthieu Wyart. Deep networks learn to parse uniform- depth context-free languages from local statistics, 2026. URL https://arxiv.org/abs/2602.06065

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

Physics of language models: Part 1, learning hier- archical language structures.Transactions on Machine Learning Research, 2025

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hier- archical language structures.Transactions on Machine Learning Research, 2025. URL https://openreview.net/forum?id=mPQKyzkA1K

2025
[56]

Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora. Do transformers parse while predicting the masked word? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16513–16542, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18...

work page doi:10.18653/v1/2023.emnlp-main.1029 2023
[57]

How transformers learn structured data: Insights from hierarchical filtering

Jerome Garnier-Brun, Marc Mézard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: Insights from hierarchical filtering. InInternational Conference on Machine Learning (ICML), 2025. arXiv:2408.15138

work page arXiv 2025
[58]

E. DeGiuli. Random language model.Phys. Rev. Lett., 122:128301, Mar 2019. doi: 10.1103/ PhysRevLett.122.128301. URL https://link.aps.org/doi/10.1103/PhysRevLett.122.128301

work page doi:10.1103/physrevlett.122.128301 2019
[59]

Unraveling Syntax: Language Modeling and the Substructure of Grammars

Laura Ying Schulz, Daniel Mitropolsky, and Tomaso Poggio. Unraveling syntax: How lan- guage models learn context-free grammars, 2025. URL https://arxiv.org/abs/2510.02524. arXiv:2510.02524

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 13 A Plateau heatmaps for different choices ofn max 0 0.2 0.4 f 0 0.25 0.5 0.75 1 ρ ρd ⋅ nmax = 104 fper finv non-ergodic ergodic higher → slower ergodic higher → faster invers...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Robert and George Casella.Monte Carlo Statistical Methods

Christian P. Robert and George Casella.Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer, New York, 2 edition, 2004

2004

[2] [2]

M. E. J. Newman and G. T. Barkema.Monte Carlo Methods in Statistical Physics. Oxford University Press, Oxford, 1999

1999

[3] [3]

Academic Press, San Diego, 2 edition, 2002

Daan Frenkel and Berend Smit.Understanding Molecular Simulation: From Algorithms to Applications. Academic Press, San Diego, 2 edition, 2002

2002

[4] [4]

Bernardi, Marcelo C.R

Rafael C. Bernardi, Marcelo C.R. Melo, and Klaus Schulten. Enhanced sampling tech- niques in molecular dynamics simulations of biological systems.Biochimica et Biophys- ica Acta (BBA) - General Subjects, 1850(5):872–877, 2015. ISSN 0304-4165. doi: https: //doi.org/10.1016/j.bbagen.2014.10.019. URL https://www.sciencedirect.com/science/article/ pii/S030441...

work page doi:10.1016/j.bbagen.2014.10.019 2015

[5] [5]

Onuchic, Z Luthey-Schulten, and P.G

J.N. Onuchic, Z Luthey-Schulten, and P.G. Wolynes. Theory of protein folding: the energy landscape perspective.Annual review of physical chemistry, 48, 545–600, 1997. doi: https: //doi.org/10.1146/annurev.physchem.48.1.545

work page doi:10.1146/annurev.physchem.48.1.545 1997

[6] [6]

Virasoro.Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 ofWorld Scientific Lecture Notes in Physics

Marc Mézard, Giorgio Parisi, and Miguel A. Virasoro.Spin Glass Theory and Beyond: An Introduction to the Replica Method and Its Applications, volume 9 ofWorld Scientific Lecture Notes in Physics. World Scientific, Singapore, 1987

1987

[7] [7]

Levin, Yuval Peres, and Elizabeth L

David A. Levin, Yuval Peres, and Elizabeth L. Wilmer.Markov Chains and Mixing Times. American Mathematical Society, Providence, RI, 2009

2009

[8] [8]

Equation of State Calculations by Fast Computing Machines

Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines.The Journal of Chemical Physics, 21(6):1087–1092, 1953. doi: 10.1063/1.1699114

work page doi:10.1063/1.1699114 1953

[9] [9]

W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97

work page doi:10.1093/biomet/57.1.97 1970

[10] [10]

Bayesian learning via stochastic gradient Langevin dynamics

Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. InProceedings of the 28th International Conference on Machine Learning (ICML), pages 681–688, 2011

2011

[11] [11]

Radford M. Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, editors,Handbook of Markov Chain Monte Carlo, pages 113–162. Chapman & Hall/CRC, Boca Raton, FL, 2011

2011

[12] [12]

Swendsen and Jian-Sheng Wang

Robert H. Swendsen and Jian-Sheng Wang. Replica Monte Carlo simulation of spin-glasses. Physical Review Letters, 57(21):2607–2609, 1986. doi: 10.1103/PhysRevLett.57.2607

work page doi:10.1103/physrevlett.57.2607 1986

[13] [13]

Earl and Michael W

David J. Earl and Michael W. Deem. Parallel tempering: Theory, applications, and new perspectives.Physical Chemistry Chemical Physics, 7(23):3910–3916, 2005. doi: 10.1039/ B509983H

2005

[14] [14]

Weiss, Niru Maheswaranathan, and Surya Ganguli

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, 2015

2015

[15] [15]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020. 10

2020

[16] [16]

Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021

2021

[17] [17]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. InInternational Conference on Machine Learning, pages 8162–8171. PMLR, 2021

2021

[18] [18]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in Neural Information Processing Systems, 2021

2021

[19] [19]

U-turn diffusion.Entropy, 27(4), 2025

Hamidreza Behjoo and Michael Chertkov. U-turn diffusion.Entropy, 27(4), 2025. ISSN 1099-4300. doi: 10.3390/e27040343. URL https://www.mdpi.com/1099-4300/27/4/343

work page doi:10.3390/e27040343 2025

[20] [20]

A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

Antonio Sclocchi, Alessandro Favero, and Matthieu Wyart. A phase transition in diffusion models reveals the hierarchical nature of data.Proceedings of the National Academy of Sciences, 122(1):e2408799121, 2025

2025

[21] [21]

The MIT Press, 50 edition, 1965

Noam Chomsky.Aspects of the Theory of Syntax. The MIT Press, 50 edition, 1965. ISBN 9780262527408. URL http://www.jstor.org/stable/j.ctt17kk81z

1965

[22] [22]

Formal language theory: refining the chomsky hierarchy

Gerhard Jäger and James Rogers. Formal language theory: refining the chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1598):1956–1970, 2012

1956

[23] [23]

JHU Press, 1996

Ulf Grenander.Elements of pattern theory. JHU Press, 1996

1996

[24] [24]

Deep Learning and Hierarchal Generative Models

Elchanan Mossel. Deep learning and hierarchal generative models, 2018. URL https://arxiv. org/abs/1612.09057

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review

Tomaso Poggio, Hrushikesh Mhaskar, Lorenzo Rosasco, Brando Miranda, and Qianli Liao. Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017

2017

[26] [26]

A Provably Correct Algorithm for Deep Learning that Actually Works

Eran Malach and Shai Shalev-Shwartz. A provably correct algorithm for deep learning that actually works, 2018. URL https://arxiv.org/abs/1803.09522

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 48(4):1875–1897, 2020

Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with relu activation function.The Annals of Statistics, 48(4):1875–1897, 2020

2020

[28] [28]

Malach and S

E. Malach and S. Shalev-Shwartz. The implications of local correlation on learning some deep functions. InAdvances in Neural Information Processing Systems, volume 33, pages 1322–1332, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/ 0e4ceef65add6cf21c0f3f9da53b71c0-Paper.pdf

2020

[29] [29]

Tomasini, Alessandro Favero, and Matthieu Wyart

Francesco Cagnetta, Leonardo Petrini, Umberto M. Tomasini, Alessandro Favero, and Matthieu Wyart. How deep neural networks learn compositional data: The random hierarchy model. Phys. Rev. X, 14:031001, Jul 2024. doi: 10.1103/PhysRevX.14.031001. URL https://link.aps. org/doi/10.1103/PhysRevX.14.031001

work page doi:10.1103/physrevx.14.031001 2024

[30] [30]

Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37: 83119–83163, 2024

Francesco Cagnetta and Matthieu Wyart. Towards a theory of how the structure of language is acquired by deep neural networks.Advances in Neural Information Processing Systems, 37: 83119–83163, 2024

2024

[31] [31]

Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart. Deriving neural scaling laws from the statistics of natural language.arXiv preprint arXiv:2602.07488, 2026

work page arXiv 2026

[32] [32]

Learning curves theory for hi- erarchically compositional data with power-law distributed features

Francesco Cagnetta, Hyunmo Kang, and Matthieu Wyart. Learning curves theory for hi- erarchically compositional data with power-law distributed features. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 6149–6164. PMLR, 2025. URL https://proceedings.mlr.press/v267/ cagnetta25a.html

2025

[33] [33]

Probing the latent hierarchical structure of data via diffusion models

Antonio Sclocchi, Alessandro Favero, Noam Itzhak Levi, and Matthieu Wyart. Probing the latent hierarchical structure of data via diffusion models. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=0GzqVqCKns. 11

2025

[34] [34]

Simoncelli

Florentin Guth, Zahra Kadkhodaie, and Eero P. Simoncelli. Learning normalized image densities via dual score matching. InAdvances in Neural Information Processing Systems, 2025

2025

[35] [35]

Hunt-Smith, W

N.T. Hunt-Smith, W. Melnitchouk, F. Ringer, N. Sato, A.W. Thomas, and M.J. White. Acceler- ating markov chain monte carlo sampling with diffusion models.Computer Physics Communi- cations, 296:109059, 2024. ISSN 0010-4655. doi: https://doi.org/10.1016/j.cpc.2023.109059. URL https://www.sciencedirect.com/science/article/pii/S0010465523004046

work page doi:10.1016/j.cpc.2023.109059 2024

[36] [36]

Springer, January

Grzegorz Rozenberg and Arto Salomaa.Handbook of Formal Languages. Springer, January

[37] [37]

doi: 10.1007/978-3-642-59126-6

work page doi:10.1007/978-3-642-59126-6

[38] [38]

Oxford University Press, 2009

Marc Mezard and Andrea Montanari.Information, physics, and computation. Oxford University Press, 2009

2009

[39] [39]

Taylor & Francis, London, 2 edition, 1994

Dietrich Stauffer and Ammon Aharony.Introduction to Percolation Theory. Taylor & Francis, London, 2 edition, 1994

1994

[40] [40]

Marginal stability in structural, spin, and electron glasses

Markus Müller and Matthieu Wyart. Marginal stability in structural, spin, and electron glasses. Annu. Rev. Condens. Matter Phys., 6(1):177–200, 2015

2015

[41] [41]

Dolma: an open corpus of three trillion tokens for language model pretraining research

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew Peters, Abhilasha Ravichander, Kyle Ri...

work page doi:10.18653/v1/2024.acl-long.840 2024

[42] [42]

Llada-moe: A sparse moe diffusion language model.arXiv preprint arXiv:2509.24389, 2025

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, and Ji-Rong Wen. Llada-moe: A sparse moe diffusion ...

work page arXiv 2025

[43] [43]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Brains and algorithms partially converge in natural language processing.Communications Biology, 5(1):134, 2022

Charlotte Caucheteux and Jean-Rémi King. Brains and algorithms partially converge in natural language processing.Communications Biology, 5(1):134, 2022. doi: 10.1038/ s42003-022-03036-1

2022

[45] [45]

Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022

Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King. Deep language algorithms predict semantic comprehension from brain activity.Scientific Reports, 12(1):16327, 2022. doi: 10.1038/s41598-022-20460-9

work page doi:10.1038/s41598-022-20460-9 2022

[46] [46]

Emergence of a high-dimensional abstraction phase in language transformers

Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Lei Yu, Alessandro Laio, and Marco Baroni. Emergence of a high-dimensional abstraction phase in language transformers. InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=0fD3iIBhlV

2025

[47] [47]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. 12

2009

[48] [48]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

2022

[49] [49]

Conneau, G

A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Iryna Gurevych and Yusuke Miyao, editors,Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia,

[50] [50]

doi: 10.18653/v1/P18-1198

Association for Computational Linguistics. doi: 10.18653/v1/P18-1198. URL https: //aclanthology.org/P18-1198

work page doi:10.18653/v1/p18-1198

[51] [51]

Tenney, D

I. Tenney, D. Das, and E. Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th An- nual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclant...

work page doi:10.18653/v1/p19-1452 2019

[52] [52]

D Manning, K

C. D Manning, K. Clark, J. Hewitt, U. Khandelwal, and O. Levy. Emergent linguistic structure in artificial neural networks trained by self-supervision.Proceedings of the National Academy of Sciences, 117(48):30046–30054, 2020

2020

[53] [53]

Schrödinger’s tree—on syntax and neural language models

Artur Kulmizev and Joakim Nivre. Schrödinger’s tree—on syntax and neural language models. Frontiers in Artificial Intelligence, 5:796788, 2022

2022

[54] [54]

Deep networks learn to parse uniform-depth context-free languages from local statistics

Jack T. Parley, Francesco Cagnetta, and Matthieu Wyart. Deep networks learn to parse uniform- depth context-free languages from local statistics, 2026. URL https://arxiv.org/abs/2602.06065

work page internal anchor Pith review Pith/arXiv arXiv 2026

[55] [55]

Physics of language models: Part 1, learning hier- archical language structures.Transactions on Machine Learning Research, 2025

Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 1, learning hier- archical language structures.Transactions on Machine Learning Research, 2025. URL https://openreview.net/forum?id=mPQKyzkA1K

2025

[56] [56]

Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora. Do transformers parse while predicting the masked word? In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16513–16542, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18...

work page doi:10.18653/v1/2023.emnlp-main.1029 2023

[57] [57]

How transformers learn structured data: Insights from hierarchical filtering

Jerome Garnier-Brun, Marc Mézard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: Insights from hierarchical filtering. InInternational Conference on Machine Learning (ICML), 2025. arXiv:2408.15138

work page arXiv 2025

[58] [58]

E. DeGiuli. Random language model.Phys. Rev. Lett., 122:128301, Mar 2019. doi: 10.1103/ PhysRevLett.122.128301. URL https://link.aps.org/doi/10.1103/PhysRevLett.122.128301

work page doi:10.1103/physrevlett.122.128301 2019

[59] [59]

Unraveling Syntax: Language Modeling and the Substructure of Grammars

Laura Ying Schulz, Daniel Mitropolsky, and Tomaso Poggio. Unraveling syntax: How lan- guage models learn context-free grammars, 2025. URL https://arxiv.org/abs/2510.02524. arXiv:2510.02524

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

Dream 7B: Diffusion Large Language Models

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025. 13 A Plateau heatmaps for different choices ofn max 0 0.2 0.4 f 0 0.25 0.5 0.75 1 ρ ρd ⋅ nmax = 104 fper finv non-ergodic ergodic higher → slower ergodic higher → faster invers...

work page internal anchor Pith review Pith/arXiv arXiv 2025