pith. machine review for the scientific record. sign in

arxiv: 2605.10019 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI· cs.CC· stat.ML

Recognition: no theorem link

The two clocks and the innovation window: When and how generative models learn rules

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:05 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CCstat.ML
keywords generative modelsrule learningmemorizationtraining dynamicsinnovation windowdiffusion modelsautoregressive modelssynthetic tasks
0
0 comments X

The pith

Generative models learn rules during an innovation window between first valid outputs and memorization of training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two distinct training timescales in generative models on rule-based synthetic tasks. τ_rule marks the point where generated samples first satisfy the underlying rule, while τ_mem marks when the model begins reproducing exact training examples. The interval between them, termed the innovation window, grows wider as dataset size N increases and narrows or disappears as rule complexity rises. The same pattern appears in both diffusion models like DiT and autoregressive models like GPT, with shifts depending on architecture. Dissection of the learned score function shows rule-valid regions expanding around τ_rule and training-sample basins dominating around τ_mem.

Core claim

We define the innovation window as the interval [τ_rule, τ_mem]. This window widens with increasing N and narrows with rule complexity, and may vanish entirely when τ_rule ≥ τ_mem. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around τ_rule, while training samples' basins begin to dominate around τ_mem.

What carries the argument

The innovation window [τ_rule, τ_mem], where τ_rule is the training step of first rule-valid generations and τ_mem is the step when the model begins reproducing training samples.

If this is right

  • The innovation window widens with larger dataset size N.
  • The window narrows or vanishes entirely as rule complexity increases.
  • The two-clock structure and window appear in both diffusion and autoregressive architectures, with architecture-specific timing offsets.
  • Rule-valid sample basins in the score function expand around τ_rule while training-sample basins dominate around τ_mem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training schedules or early stopping around the innovation window could promote rule learning over memorization.
  • The framework may extend to predict generalization behavior on tasks beyond the binary rules and puzzles studied.
  • Varying model capacity could shift τ_rule earlier relative to τ_mem, enlarging the window on fixed data.

Load-bearing premise

The chosen definitions of τ_rule and τ_mem separate genuine rule learning from partial memorization or other artifacts on the synthetic tasks.

What would settle it

Observe whether τ_rule exceeds τ_mem on a high-complexity rule with small N, resulting in no rule-valid generations before memorization begins.

Figures

Figures reproduced from arXiv: 2605.10019 by Bingbin Liu, Binxu Wang, Emma Lucia Byrnes Finn.

Figure 1
Figure 1. Figure 1: Rule learning and memorization for group-parity. A. Structure of the Group Parity dataset and evaluation setup. Each D-dim binary image is divided into groups of size G (here D = 36, G = 6), with each group satisfying even parity. We train DiT models on datasets of N = 4096 samples and evaluate model generations in term of accuracy and memorization ratio, at both (B) sample and (C) group level, across diff… view at source ↗
Figure 2
Figure 2. Figure 2: Two clocks τrule and τmem scale differently with rule, data, and model. A. Schematic: rule complexity G delays τrule, dataset size N scales τmem near-linearly, model size (L, H) modulates both. The interval [τrule, τmem] is the innovation window of valid novel generation. B. Single-run learning dynamics (DiT-mini, N=4096, G=3): rule accuracy (blue), memorization ratio (orange dashed), and NaN/quantization … view at source ↗
Figure 3
Figure 3. Figure 3: Dissecting rule learning and memorization dynamics via multiple perspectives. A. Per￾sample training-dynamics state raster for 2000 fixed initial noises. Samples are classified into four states across training steps: invalid (quantization-ambiguous), invalid (rule-violating), valid & novel, and valid & memorized, shown as (Top) counts or (Bottom) per-seed evolution throughout training. The transition matri… view at source ↗
Figure 4
Figure 4. Figure 4: Diffusion (DiT) and autoregressive (GPT) transformers share the two-clock structure with different timescales. A. Sample-level rule accuracy (solid) and memorization ratio (dashed) vs. training step for DiT-mini (green) and GPT-mini (purple) at G=3, N=4096. Shaded: innovation window where rule accuracy > 0.9 and memorization < 0.1. B. Per-architecture heatmaps across G ∈ {2, 3, 4, 6, 9, 12, 18, 36} at N=40… view at source ↗
Figure 5
Figure 5. Figure 5: Training-sample montage for binary rule families. Each panel shows a 5×5 grid of representative 6×6 training samples for one rule variant. Rows correspond to group parity, exact-K, row-K, row-variable-K, and global-K datasets. Columns vary the group size G, exact count K, or allowed count set K. The montage illustrates how the same binary image space supports qualitatively different constraints: parity imp… view at source ↗
Figure 6
Figure 6. Figure 6: Training-sample montage for structured categorical rule families. Each row shows representative training samples from a different structured rule family, with three independently drawn samples shown per family. Rows correspond to RowOnly Latin Square (n = 6), Latin Square (n = 5), Latin Square (n = 6), and Sudoku (6×6 grid with 2×3 blocks). Colors denote categorical values: five colors for n = 5 and six co… view at source ↗
Figure 7
Figure 7. Figure 7: Memorization and Creativity in Parity Learning across dataset scales. Memorization ratio overlay on accuracy, for different group size and training dataset scale, at group level (A.) and sample level (B.), with similar format as Fig. 1B,C. Red solid line shows the memorization ratio of the ground truth distribution (P + G for groups, and (P + G ) D/G for samples); and blue solid line shows the memorization… view at source ↗
Figure 8
Figure 8. Figure 8: Learning dynamics of rule acquisition and memorization across dataset size, G = 2, 3, 4, 6. Left. Dynamics of sample parity accuracy across dataset scale, DiT-mini. Mid. Right. Dynamics of sample memorization ratio across dataset scales, the dynamics are plotted as a function of step (Mid.) and step per sample (step x batch size/ dataset size) (Right.). Colored dashed lines denotes the memorization ratio e… view at source ↗
Figure 9
Figure 9. Figure 9: Learning dynamics of rule acquisition and memorization across dataset size, G = 9, 12, 18, 36. Similar format as [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Memorization onset time τmem as a function of training set size N, for all DiT and GPT model scales and group sizes G. Each point is one experiment; dashed lines show power-law fits τmem ∝ Nα per model and G. The near-linear exponent α ≈ 1–1.1 is consistent across architectures and scales (see [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Per-sample state transitions reveal staged diffusion learning. Top: raster plot tracking the state of each generated sample across training, sorted by the step at which it first becomes memorized, same panel as Fig. 3A. Middle: transition count matrices between sample states in three representative training windows: early quantization transition (steps 30–300), rule-learning transition (steps 10k–30k), an… view at source ↗
Figure 12
Figure 12. Figure 12: Valid generations move progressively closer to the training set before memorization. Mean Hamming distance from generated samples to the closest training set sample over late training, shown separately for each parity group size G. Black curves show the average over all generated samples; blue curves show the average restricted to rule-valid samples; red curves show rule-valid but non-memorized samples. D… view at source ↗
Figure 13
Figure 13. Figure 13: Noise-scale spectra at representative training phases. DSM loss as a function of σ is shown at the four checkpoints: early non-quantized generation, quantized but rule-violating generation, rule-learned generation, and memorization onset (same as Fig. 3D). Train, held-out valid, and random Boolean-cube samples are nearly indistinguishable before rule learning. After rule learning, the random Boolean-cube … view at source ↗
Figure 14
Figure 14. Figure 14: DSM loss dynamics across noise-scale bands. For DiT-mini trained on parity with G=3, N=4096 (run 2), we track DSM losses for training samples, held-out rule-valid samples, and random Boolean-cube samples, separately across logarithmic noise-scale bands. The top panel shows the corresponding sample-state counts across training (Fig. 3A). Low-noise losses quickly collapse across splits as the model learns t… view at source ↗
Figure 15
Figure 15. Figure 15: Onset of Train–test DSM loss gap zooming in the critical noise scales [0.2, 2.0]. Left: heatmap of the DSM loss gap between held-out rule-valid samples and training samples as a function of training step and noise scale σ, focusing on the critical noise scale. Right: the same gap plotted over training for representative fixed σ values. We note that the train-test gap first opens for noise scales around σ … view at source ↗
Figure 16
Figure 16. Figure 16: Attractor-basin profiles across training phases. For a DiT-mini model trained on group parity (G=3, N=4096), we probe 1D line segments from a training sample xa toward three endpoints: an invalid Hamming-1 neighbor (left), a valid-novel Hamming-2 neighbor (middle), and the nearest other training sample (right). Rows show exact-bit match to xa, Hamming distance from xa, and denoiser ℓ2 distance ∥D(x, σ) − … view at source ↗
Figure 17
Figure 17. Figure 17: Early vector-field landscape evolution. Score magnitude (top row of each checkpoint) and signed denoising displacement along vab (bottom row) are shown for σ ∈ {0.20, 0.50, 1.00, 2.00}. Before rule learning, the landscape first forms attractors around Boolean￾cube vertices; after rule learning begins, the valid anchors xa and xb start to dominate the rule￾violating anchors xc and xd. 36 [PITH_FULL_IMAGE:… view at source ↗
Figure 18
Figure 18. Figure 18: Mid-training vector-field landscape evolution. The same 2D slice and noise scales are tracked through the generalizing phase. The basin boundary between the training sample xa and valid-novel neighbor xb remains visible while the attraction toward xa progressively strengthens, especially at intermediate noise scales. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Late vector-field landscape evolution. During memorization onset and late memorization, the attraction basin of the training sample xa expands toward xb and eventually dominates much of the plane. The effect is strongest around the same intermediate noise scales where the DSM train–test gap opens. 38 [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: shows the GPT analogue of [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Rule-learning and memorization timescales beyond parity. DiT-mini models are trained on 6×6 binary samples with N=4096. A. Rule onset (circles; sample rule accuracy ≥ 90%) and memorization onset (squares; memorization ratio above baseline) across parity, exact-K, row-K, row-variable-K, and global-K tasks. B–E. Onset times across different K values or K lists for exact-K, row-K, global-K, and row-variable-… view at source ↗
Figure 22
Figure 22. Figure 22: Learning dynamics for structured categorical rule families: row only (Permutation), LatinSquare, Sudoku. Training trajectories of DiT-mini on three n = 6 structured categorical datasets: RowOnly (row permutation), Latin Square, and 6×6 Sudoku with 2×3 blocks. Columns show train loss, full-sample valid ratio, row-valid ratio, column-valid ratio, and sample memorization ratio over training. The top row show… view at source ↗
Figure 23
Figure 23. Figure 23: Onset times for rule learning, novel valid generation, and memorization in structured categorical tasks. We compare DiT-mini training dynamics across RowOnly, Latin Square, and Sudoku rule families. Left: scalar encoding. Right: one-hot encoding. For each task, we measure the first training step at which the model reaches rule accuracy ≥ 85% (blue circles), novel-valid fraction ≥ 0.5 (orange squares), and… view at source ↗
Figure 24
Figure 24. Figure 24: Learning-rate sweep for DiT-mini on group parity with G = 6. We train DiT-mini models with learning rates ranging from 10−5 to 3×10−3 and track rule accuracy, memorization ratio, rule-learning time τrule, and memorization time τmem. Moderate learning rates enable rule learning, with lr = 10−4 producing the earliest reliable rule-learning transition, followed by lr = 3×10−5 and 10−5 . Larger learning rates… view at source ↗
Figure 25
Figure 25. Figure 25: Weight-decay sweep for DiT-mini on group parity with G = 6. We train DiT-mini models across a range of weight decay values while keeping the learning rate fixed. Rule learning occurs robustly across small to moderate weight decay values, with τrule remaining on the order of 105 steps. Very large weight decay delays and weakens rule acquisition, as seen for wd = 10−2 , which does not reach the rule-learnin… view at source ↗
Figure 26
Figure 26. Figure 26: Learning-rate sweep for GPT-mini on group parity with G = 6. We train GPT-mini models with learning rates from 10−5 to 3×10−3 and measure rule accuracy, memorization ratio, τrule, and τmem. Rule learning is strongly learning-rate dependent: intermediate learning rates, especially lr = 10−4 and 3×10−4 , reach high rule accuracy rapidly, whereas larger learning rates fail to learn the rule. Memorization fol… view at source ↗
Figure 27
Figure 27. Figure 27: Weight-decay sweep for GPT-mini on group parity with G = 6. We train GPT-mini models across weight decay values from 0 to 1 and track rule accuracy, memorization ratio, τrule, and τmem. Across most weight decay values, GPT-mini rapidly reaches high rule accuracy, indicating that rule learning is robust to moderate regularization. However, memorization time is strongly affected by large weight decay: incre… view at source ↗
Figure 28
Figure 28. Figure 28: Cross-run consistency of GPT and DiT near the parity learnability frontier. Evaluation trajectories across independent runs for matched GPT-mini and DiT-mini configurations, varying dataset size N and group size G. Different shades of the same hue denote different seeds; thin faded curves show raw trajectories and thick curves show EMA-smoothed trajectories. Blue denotes rule accuracy, orange dashed denot… view at source ↗
read the original abstract

Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $\tau_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $\tau_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $\tau_{\mathrm{rule}}$ and $\tau_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $\tau_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $\tau_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[\tau_{\mathrm{rule}}, \tau_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $\tau_{\mathrm{rule}} \geq \tau_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $\tau_{\mathrm{rule}}$, while training samples' basins begin to dominate around $\tau_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that generative models on synthetic rule-based tasks exhibit two distinct training timescales: τ_rule (the step at which generated samples first satisfy the underlying rule, e.g., parity) and τ_mem (the step at which models begin reproducing exact training samples). It defines the 'innovation window' as the interval [τ_rule, τ_mem], shows that this window widens with dataset size N and narrows with rule complexity (potentially vanishing when τ_rule ≥ τ_mem), demonstrates the same structure in both DiT diffusion and GPT autoregressive models with architecture-dependent offsets, and supports the claim via analysis of how rule-valid sample basins expand in the DiT score landscape around τ_rule while training-sample basins dominate around τ_mem.

Significance. If the separation between the clocks holds under rigorous controls, the work offers a unified, predictive account of the transition from rule acquisition to memorization in generative models, with potential implications for training regimes that extend the innovation window. Strengths include the use of controlled synthetic tasks, cross-architecture consistency, and the landscape dissection in DiT; however, the absence of reported statistical details (runs, error bars, scoring protocols) currently limits the strength of the empirical claims.

major comments (2)
  1. [Abstract and experimental sections] The central claim that τ_rule marks genuine rule learning (distinct from partial memorization or sampling artifacts) rests on the operational definitions of τ_rule and τ_mem, but the abstract and experimental description provide no details on statistical controls, number of runs, error bars, or explicit ablations (e.g., diversity of rule-consistent but non-training samples or controls for local n-gram statistics on parity tasks). This makes it impossible to verify that the reported dependencies on N and complexity are not confounded by chance or partial pattern matching, directly undermining the interpretation of the innovation window.
  2. [Abstract] The claim that τ_mem is 'approximately invariant to the rule' and scales 'nearly linearly with N' while τ_rule depends on complexity is load-bearing for the window's predictive power, yet no equations or fitting procedures are shown that reduce these to parameter-free quantities; without such reduction or explicit controls for how rule-validity is scored, the reported architecture-dependent offsets remain descriptive rather than explanatory.
minor comments (2)
  1. Clarify the precise criterion used to declare a generation 'rule-valid' (e.g., exact satisfaction of parity or combinatorial constraints) and how τ_rule is detected in practice (first occurrence, majority vote over samples, etc.).
  2. The manuscript would benefit from a table or figure summarizing the measured τ values across rules, N, and capacities, including variability across runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential and for the detailed comments. We address each major comment below, agreeing to enhance the statistical reporting and quantitative analysis in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental sections] The central claim that τ_rule marks genuine rule learning (distinct from partial memorization or sampling artifacts) rests on the operational definitions of τ_rule and τ_mem, but the abstract and experimental description provide no details on statistical controls, number of runs, error bars, or explicit ablations (e.g., diversity of rule-consistent but non-training samples or controls for local n-gram statistics on parity tasks). This makes it impossible to verify that the reported dependencies on N and complexity are not confounded by chance or partial pattern matching, directly undermining the interpretation of the innovation window.

    Authors: We acknowledge the referee's concern regarding the lack of reported statistical details. In the revised version, we will expand the experimental sections to include the number of independent runs performed, error bars, and explicit descriptions of the scoring protocols for rule validity. We will also incorporate ablations on the diversity of generated rule-consistent samples not in the training data and controls for local statistics on parity tasks to rule out partial pattern matching. These changes will strengthen the evidence for genuine rule learning at τ_rule. revision: yes

  2. Referee: [Abstract] The claim that τ_mem is 'approximately invariant to the rule' and scales 'nearly linearly with N' while τ_rule depends on complexity is load-bearing for the window's predictive power, yet no equations or fitting procedures are shown that reduce these to parameter-free quantities; without such reduction or explicit controls for how rule-validity is scored, the reported architecture-dependent offsets remain descriptive rather than explanatory.

    Authors: We recognize that the scaling claims would be more robust with explicit quantitative fits. The observations of τ_mem's near-invariance to the rule and linear scaling with N, as well as τ_rule's dependence on complexity, are drawn from systematic experiments across various configurations. In the revision, we will add equations and fitting procedures (e.g., linear regression models for τ_mem as a function of N) with goodness-of-fit metrics, and provide the precise definition of the rule-validity scoring function used. This will better explain the architecture-dependent offsets by grounding them in the optimization dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity: operational definitions of the clocks are independent measurements yielding empirical dependencies

full rationale

The paper defines τ_rule as the training step when generations first satisfy the rule (measured by direct validity checks on samples) and τ_mem as the step when training samples begin to be reproduced (measured by exact matching). The innovation window is then defined simply as the interval between these two observed quantities. All reported scalings—τ_rule increasing with rule complexity and decreasing with capacity, τ_mem linear in N and rule-invariant—are presented as results of controlled experiments that vary N, complexity, and architecture while recording the measured clocks. No equations reduce these quantities to fitted parameters or presuppose the window structure; the landscape analysis in DiT models is likewise an observational dissection of score evolution around the measured times. No self-citations or imported uniqueness theorems are invoked as load-bearing premises. The derivation chain is therefore self-contained empirical observation rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from synthetic rule tasks; no free parameters are mentioned. The key domain assumption is that these tasks capture the essential tension between rule learning and memorization in generative models.

axioms (1)
  • domain assumption Synthetic rule-valid tasks (parity and combinatorial puzzles) are representative of the rule-learning dynamics that occur when generative models are trained on real data.
    The paper uses these tasks to trace τ_rule and τ_mem across training.

pith-pipeline@v0.9.0 · 5623 in / 1364 out tokens · 49614 ms · 2026-05-12T03:05:23.416598+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 14 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Understanding generalizability of diffusion models requires rethinking the hidden gaussian structure , author=. Advances in neural information processing systems , volume=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Locality in image diffusion models emerges from data statistics , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    arXiv preprint arXiv:2602.02908 , year=

    A Random Matrix Theory Perspective on the Consistency of Diffusion Models , author=. arXiv preprint arXiv:2602.02908 , year=

  4. [4]

    Vision Transformers Need Registers

    Vision transformers need registers , author=. arXiv preprint arXiv:2309.16588 , year=

  5. [5]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Representation alignment for generation: Training diffusion transformers is easier than you think , author=. arXiv preprint arXiv:2410.06940 , year=

  6. [6]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Grokking: Generalization beyond overfitting on small algorithmic datasets , author=. arXiv preprint arXiv:2201.02177 , year=

  7. [7]

    1998 , issue_date =

    Kearns, Michael , title =. 1998 , issue_date =. doi:10.1145/293347.293351 , journal =

  8. [8]

    The Thirteenth International Conference on Learning Representations,

    Juno Kim and Taiji Suzuki , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  9. [9]

    International Conference on Learning Representations , year =

    From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency , author =. International Conference on Learning Representations , year =. doi:10.48550/arXiv.2410.05459 , bibSource =

  10. [10]

    Hidden progress in deep learning: Sgd learns parities near the computational limit

    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit , author =. Neural Information Processing Systems , year =. doi:10.48550/arXiv.2207.08799 , bibSource =

  11. [11]

    2023 , journal =

    Pareto Frontiers in Neural Feature Learning: Data, Compute, Width, and Luck , author =. 2023 , journal =

  12. [12]

    Transactions of the Association for Computational Linguistics , volume=

    Theoretical limitations of self-attention in neural sequence models , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

  13. [13]

    2024 , journal =

    Why are Sensitive Functions Hard for Transformers? , author =. 2024 , journal =

  14. [14]

    Transformers learn shortcuts to automata

    Transformers learn shortcuts to automata , author=. arXiv preprint arXiv:2210.10749 , year=

  15. [15]

    2022 , journal =

    Overcoming a Theoretical Limitation of Self-Attention , author =. 2022 , journal =

  16. [16]

    arXiv preprint arXiv:2105.11115 , year=

    Self-attention networks can process bounded hierarchical languages , author=. arXiv preprint arXiv:2105.11115 , year=

  17. [17]

    The Twelfth International Conference on Learning Representations,

    Margalit Glasgow , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  18. [18]

    2025 , journal =

    Hardness of Learning Fixed Parities with Neural Networks , author =. 2025 , journal =

  19. [19]

    Neural Information Processing Systems , year =

    How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad , author =. Neural Information Processing Systems , year =

  20. [20]

    T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models , author =. AAAI Conference on Artificial Intelligence , year =. doi:10.48550/arXiv.2302.08453 , bibSource =

  21. [21]

    Tight pair query lower bounds for matching and earth mover’s distance

    2025 IEEE 66th Annual Symposium on Foundations of Computer Science (FOCS) , pages =. 2025 , author =. doi:10.1109/FOCS63196.2025.00136 , title =

  22. [22]

    2025 , journal =

    Dynamical Decoupling of Generalization and Overfitting in Large Two-Layer Networks , author =. 2025 , journal =

  23. [23]

    2024 , journal =

    Simplicity Bias of Transformers to Learn Low Sensitivity Functions , author =. 2024 , journal =

  24. [24]

    Annual Conference Computational Learning Theory , year =

    SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics , author =. Annual Conference Computational Learning Theory , year =. doi:10.48550/arXiv.2302.11055 , bibSource =

  25. [25]

    2025 , journal =

    The Generative Leap: Sharp Sample Complexity for Efficiently Learning Gaussian Multi-Index Models , author =. 2025 , journal =

  26. [26]

    Annual Meeting of the Association for Computational Linguistics , year =

    Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions , author =. Annual Meeting of the Association for Computational Linguistics , year =. doi:10.48550/arXiv.2211.12316 , bibSource =

  27. [27]

    2023 , eprint=

    In search of dispersed memories: Generative diffusion models are associative memory networks , author=. 2023 , eprint=

  28. [28]

    Journal of Machine Learning Research , volume =

    Generalization on the unseen, logic reasoning and degree curriculum , author =. Journal of Machine Learning Research , volume =

  29. [29]

    Learning High-Degree Parities: The Crucial Role of the Initialization , booktitle =

    Emmanuel Abbe and Elisabetta Cornacchia and Jan Hazla and Donald Kougang. Learning High-Degree Parities: The Crucial Role of the Initialization , booktitle =. 2025 , url =

  30. [30]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Albergo, Michael S. and Boffi, Nicholas M. and. Stochastic. 2023 , month = nov, number =. doi:10.48550/arXiv.2303.08797 , urldate =. arXiv , keywords =:2303.08797 , primaryclass =

  31. [31]

    2024 , eprint=

    A Good Score Does not Lead to A Good Generative Model , author=. 2024 , eprint=

  32. [32]

    arXiv preprint arXiv:2411.19339 , year=

    Towards a mechanistic explanation of diffusion model generalization , author=. arXiv preprint arXiv:2411.19339 , year=

  33. [33]

    2025 , eprint=

    Why Diffusion Models Don't Memorize: The Role of Implicit Dynamical Regularization in Training , author=. 2025 , eprint=

  34. [34]

    Align Your Latents:

    Blattmann, Andreas and Rombach, Robin and Ling, Huan and Dockhorn, Tim and Kim, Seung Wook and Fidler, Sanja and Kreis, Karsten , year =. Align Your Latents:. Proceedings of the

  35. [35]

    2024 , eprint =

    Dynamical Regimes of Diffusion Models , author =. 2024 , eprint =

  36. [36]

    arXiv preprint arXiv:2602.17846 , year=

    Two Calm Ends and the Wild Middle: A Geometric Picture of Memorization in Diffusion Models , author=. arXiv preprint arXiv:2602.17846 , year=

  37. [37]

    PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

    Chen, Junsong and Yu, Jincheng and Ge, Chongjian and Yao, Lewei and Xie, Enze and Wu, Yue and Wang, Zhongdao and Kwok, James and Luo, Ping and Lu, Huchuan and Li, Zhenguo , year =. doi:10.48550/arXiv.2310.00426 , urldate =. arXiv , keywords =:2310.00426 , primaryclass =

  38. [38]

    Sampling Is as Easy as Learning the Score: Theory for Diffusion Models with Minimal Data Assumptions , booktitle =

    Chen, Sitan and Chewi, Sinho and Li, Jerry and Li, Yuanzhi and Salim, Adil and Zhang, Anru , year =. Sampling Is as Easy as Learning the Score: Theory for Diffusion Models with Minimal Data Assumptions , booktitle =

  39. [39]

    Deconstructing denoising diffusion models for self-supervised learning

    Deconstructing Denoising Diffusion Models for Self-Supervised Learning , author =. 2024 , journal =. 2401.14404 , archiveprefix =

  40. [40]

    Stargan v2:

    Choi, Yunjey and Uh, Youngjung and Yoo, Jaejun and Ha, Jung-Woo , year =. Stargan v2:. Proceedings of the

  41. [41]

    Conwell, Colin and Ullman, Tomer , year =. Testing. doi:10.48550/arXiv.2208.00005 , urldate =. arXiv , keywords =:2208.00005 , primaryclass =

  42. [42]

    2025 , eprint=

    Origins of Creativity in Attention-Based Diffusion Models , author=. 2025 , eprint=

  43. [43]

    arXiv preprint arXiv:2404.18869 , year=

    Learning Mixtures of Gaussians Using Diffusion Models , author =. 2024 , journal =. 2404.18869 , archiveprefix =

  44. [44]

    2020 , journal =

    Denoising Diffusion Probabilistic Models , author =. 2020 , journal =

  45. [45]

    Classifier-Free Diffusion Guidance

    Classifier-Free Diffusion Guidance , author =. 2022 , journal =. 2207.12598 , archiveprefix =

  46. [46]

    Video Diffusion Models , booktitle =

    Ho, Jonathan and Salimans, Tim and Gritsenko, Alexey and Chan, William and Norouzi, Mohammad and Fleet, David J , editor =. Video Diffusion Models , booktitle =. 2022 , volume =

  47. [47]

    2005 , month = dec, journal =

    Estimation of Non-Normalized Statistical Models by Score Matching , author =. 2005 , month = dec, journal =

  48. [48]

    Analyzing and improving the training dynamics of diffusion models.arXiv preprint arXiv:2312.02696, 2024

    Karras, Tero and Aittala, Miika and Lehtinen, Jaakko and Hellsten, Janne and Aila, Timo and Laine, Samuli , year =. Analyzing and. doi:10.48550/arXiv.2312.02696 , urldate =. arXiv , keywords =:2312.02696 , primaryclass =

  49. [49]

    2024 , eprint=

    An analytic theory of creativity in convolutional diffusion models , author=. 2024 , eprint=

  50. [50]

    International Conference on Learning Representations , author =

  51. [51]

    arXiv preprint arXiv:2106.06406 , eprint =

    Lee, Sang-gil and Kim, Heeseung and Shin, Chaehun and Tan, Xu and Liu, Chang and Meng, Qi and Qin, Tao and Chen, Wei and Yoon, Sungroh and Liu, Tie-Yan , year =. arXiv preprint arXiv:2106.06406 , eprint =

  52. [52]

    Pseudo numerical methods for diffusion models on manifolds

    Pseudo Numerical Methods for Diffusion Models on Manifolds , author =. 2022 , journal =. 2202.09778 , archiveprefix =

  53. [53]

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

    Lu, Cheng and Zhou, Yuhao and Bao, Fan and Chen, Jianfei and Li, Chongxuan and Zhu, Jun , year =. Dpm-Solver++:. arXiv preprint arXiv:2211.01095 , eprint =

  54. [54]

    Dpm-Solver:

    Lu, Cheng and Zhou, Yuhao and Bao, Fan and Chen, Jianfei and Li, Chongxuan and Zhu, Jun , year =. Dpm-Solver:. Advances in Neural Information Processing Systems , volume =

  55. [55]

    Qi Mao, Hao Cheng, Tinghan Yang, Libiao Jin, and Siwei Ma

    Ma, Nanye and Goldstein, Mark and Albergo, Michael S. and Boffi, Nicholas M. and. 2024 , month = sep, number =. doi:10.48550/arXiv.2401.08740 , urldate =. arXiv , keywords =:2401.08740 , primaryclass =

  56. [56]

    and Tanaka, Hidenori , year =

    Okawa, Maya and Lubana, Ekdeep Singh and Dick, Robert P. and Tanaka, Hidenori , year =. Compositional. doi:10.48550/arXiv.2310.09336 , urldate =. arXiv , keywords =:2310.09336 , primaryclass =

  57. [57]

    Emergence of

    Park, Core Francisco and Okawa, Maya and Lee, Andrew and Tanaka, Hidenori and Lubana, Ekdeep Singh , year =. Emergence of. doi:10.48550/arXiv.2406.19370 , urldate =. arXiv , keywords =:2406.19370 , primaryclass =

  58. [59]

    arXiv e-prints , number =

    Xu, Yilun and Liu, Ziming and Tian, Yonglong and Tong, Shangyuan and Tegmark, Max and Jaakkola, Tommi , year =. arXiv e-prints , number =. doi:10.48550/arXiv.2302.04265 , adsnote =. arXiv , keywords =:2302.04265 , primaryclass =

  59. [60]

    Diffusion models for gaussian distributions: Exact solutions and wasserstein errors.arXiv preprint arXiv:2405.14250,

    Pierret, Emile and Galerne, Bruno , year =. Diffusion Models for. arXiv preprint arXiv:2405.14250 , eprint =

  60. [61]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , year =. Exploring the. doi:10.48550/arXiv.1910.10683 , urldate =. arXiv , keywords =:1910.10683 , primaryclass =

  61. [62]

    High-Resolution Image Synthesis with Latent Diffusion Models , booktitle =

    Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj. High-Resolution Image Synthesis with Latent Diffusion Models , booktitle =. 2022 , pages =

  62. [63]

    Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj. High-. 2022 , month = apr, number =. doi:10.48550/arXiv.2112.10752 , urldate =. arXiv , keywords =:2112.10752 , primaryclass =

  63. [64]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Saharia, Chitwan and Chan, William and Saxena, Saurabh and Li, Lala and Whang, Jay and Denton, Emily and Ghasemipour, Seyed Kamyar Seyed and Ayan, Burcu Karagol and Mahdavi, S. Sara and Lopes, Rapha Gontijo and Salimans, Tim and Ho, Jonathan and Fleet, David J. and Norouzi, Mohammad , year =. Photorealistic. doi:10.48550/arXiv.2205.11487 , urldate =. arXi...

  64. [65]

    Closed-form diffusion models.arXiv preprint arXiv:2310.12395, 2023

    Closed-Form Diffusion Models , author =. 2023 , journal =. 2310.12395 , archiveprefix =

  65. [66]

    2023 , journal =

    Learning Mixtures of Gaussians Using the Ddpm Objective , author =. 2023 , journal =

  66. [67]

    Sliced Score Matching: A Scalable Approach to Density and Score Estimation , booktitle =

    Song, Yang and Garg, Sahaj and Shi, Jiaxin and Ermon, Stefano , editor =. Sliced Score Matching: A Scalable Approach to Density and Score Estimation , booktitle =. 2020 , month = jul, series =

  67. [68]

    Generative Modeling by Estimating Gradients of the Data Distribution , booktitle =

    Song, Yang and Ermon, Stefano , editor =. Generative Modeling by Estimating Gradients of the Data Distribution , booktitle =. 2019 , volume =

  68. [69]

    Denoising Diffusion Implicit Models

    Denoising Diffusion Implicit Models , author =. 2020 , journal =. 2010.02502 , archiveprefix =

  69. [70]

    Score-Based Generative Modeling through Stochastic Differential Equations , booktitle =

    Song, Yang and. Score-Based Generative Modeling through Stochastic Differential Equations , booktitle =

  70. [71]

    What the

    Tang, Raphael and Pandey, Akshat and Jiang, Zhiying and Yang, Gefei and Kumar, Karun and Lin, Jimmy and Ture, Ferhan , year =. What the. arXiv preprint arXiv:2210.04885 , eprint =

  71. [72]

    2011 , journal =

    A Connection between Score Matching and Denoising Autoencoders , author =. 2011 , journal =

  72. [73]

    A Geometric Analysis of Deep Generative Image Models and Its Applications , booktitle =

    Wang, Binxu and Ponce, Carlos R , year =. A Geometric Analysis of Deep Generative Image Models and Its Applications , booktitle =

  73. [74]

    2023 , journal =

    Diffusion Models Generate Images like Painters: An Analytical Theory of Outline First, Details Later , author =. 2023 , journal =. 2303.02490 , archiveprefix =

  74. [75]

    URL https: //openreview.net/forum?id=CD9Snc73AW

    The Hidden Linear Structure in Score-Based Models and Its Application , author =. 2023 , month = nov, journal =. doi:10.48550/arXiv.2311.10892 , adsnote =. arXiv , keywords =:2311.10892 , primaryclass =

  75. [76]

    Relational composition in neural networks: A survey and call to action.arXiv preprint arXiv:2407.14662,

    Wattenberg, Martin and Vi. Relational. 2024 , month = jul, number =. doi:10.48550/arXiv.2407.14662 , urldate =. arXiv , keywords =:2407.14662 , publisher =

  76. [77]

    Score-Based Generative Model Learn Manifold-like Structures with Constrained Mixing , booktitle =

    Wenliang, Li Kevin and Moran, Ben , year =. Score-Based Generative Model Learn Manifold-like Structures with Constrained Mixing , booktitle =

  77. [78]

    2023 , journal =

    Score-Based Generative Models Learn Manifold-like Structures with Constrained Mixing , author =. 2023 , journal =. 2311.09952 , archiveprefix =

  78. [79]

    Making Text-to-Image Diffusion Models Zero-Shot Image-to-Image Editors by Inferring ''random Seeds'' , booktitle =

    Wu, Chen Henry and. Making Text-to-Image Diffusion Models Zero-Shot Image-to-Image Editors by Inferring ''random Seeds'' , booktitle =

  79. [80]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Xie, Enze and Chen, Junsong and Chen, Junyu and Cai, Han and Tang, Haotian and Lin, Yujun and Zhang, Zhekai and Li, Muyang and Zhu, Ligeng and Lu, Yao and Han, Song , year =. doi:10.48550/arXiv.2410.10629 , urldate =. arXiv , keywords =:2410.10629 , primaryclass =

  80. [81]

    , editor =

    Xu, Yilun and Liu, Ziming and Tegmark, Max and Jaakkola, Tommi S. , editor =. Poisson Flow Generative Models , booktitle =

Showing first 80 references.