pith. sign in

arxiv: 2604.17673 · v1 · submitted 2026-04-20 · 💻 cs.LG

Grokking of Diffusion Models: Case Study on Modular Addition

Pith reviewed 2026-05-10 05:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords diffusion modelsgrokkingmodular additionmechanistic interpretabilityflow matchinggeneralizationalgorithmic learning
0
0 comments X

The pith

Diffusion models exhibit grokking on modular addition by composing periodic operand representations or separating arithmetic computation from visual denoising across sampling timesteps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion models trained with flow-matching objectives display grokking on modular addition, overfitting before generalizing in a delayed way. This setup allows precise examination of the models' internal steps for performing the discrete operation within continuous image generation. In single-image settings the models combine periodic patterns representing each input number. In settings with varied images the iterative generation process splits the work into an early arithmetic phase followed by later image cleanup past a key timestep threshold.

Core claim

Diffusion models with flow-matching objectives exhibit grokking on modular addition. Mechanistic analysis shows that in a single-image regime the models implement the operation by composing periodic representations of the individual operands. In a diverse-image regime the models exploit their iterative sampling to divide the task into an arithmetic computation phase followed by a visual denoising phase separated by a critical timestep.

What carries the argument

The iterative sampling process under flow-matching training that partitions modular addition into distinct arithmetic and denoising phases or composes periodic representations of operands.

If this is right

  • Diffusion models can bridge continuous pixel-space generation with discrete symbolic reasoning through their sampling dynamics.
  • Grokking provides a controlled window into algorithmic learning inside generative models beyond transformers.
  • The separation of computation and denoising phases offers a natural decomposition for tasks mixing reasoning and generation.
  • Similar internal structures may appear in other arithmetic or logical operations learned by diffusion models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same phase-separation mechanism could be tested on other symbolic tasks such as modular multiplication to check if arithmetic always precedes denoising.
  • Interventions at the critical timestep might allow selective editing of the computed result without altering the final image style.
  • This decomposition suggests diffusion models could serve as hybrid systems for visual and logical problems by design rather than accident.

Load-bearing premise

The internal representations and phase separations uncovered by mechanistic dissection match the model's actual computation rather than arising from data encoding or visualization choices.

What would settle it

Ablating the periodic representations or shifting the identified critical timestep threshold and measuring whether accuracy on unseen modular addition cases drops sharply would falsify the decomposition.

Figures

Figures reproduced from arXiv: 2604.17673 by Jiatao Gu, Joon Hyeok Kim, Mattis Dals{\ae}tra {\O}stby, Yong-Hyun Park.

Figure 1
Figure 1. Figure 1: Overview of the Analysis Pipeline. (Left) To mechanistically explain the observed grokking phenomenon, we demonstrate that the model generalizes the periodic algorithmic structure of modular addition. Using Fourier analysis, we identify dominant frequencies in the model activa￾tions and verify that they reconstruct trigonometric identities. (Middle) Extending our analysis to the multi-image regime, we inve… view at source ↗
Figure 2
Figure 2. Figure 2: Grokking dynamics under diverse visual complexities. (Left) Baseline (N = 1, S = 32): Classic grokking with a significant generalization lag post-training saturation. (Middle) Variant 1 (N = 4, S = 32): Increased visual diversity reduces the lag, showing synchronized convergence where validation accuracy spikes before training saturation—motivating the N = 256 scaling in Section 4.2. (Right) Variant 2 (N =… view at source ↗
Figure 3
Figure 3. Figure 3: Spectral Analysis via Fractional Variance Explained (FVE). FVE measures the propor￾tion of total power attributed to each frequency’s Fourier coefficients. Bars represent spectral den￾sities for frequencies wk, including one-degree (1D) and two-degree (2D) components; 2D features capture quadratic interactions such as cos(wka) sin(wkb). (Left) In Attention Score A, mediating query (c) and key (a), four sel… view at source ↗
Figure 4
Figure 4. Figure 4: Grokking dynamics in N = 256 Discrete Concept Formation via FFN However, the standard baseline architecture (SA–FFN) failed to trigger grokking in the diverse-image regime (N = 256). Notably, scaling up the model’s width was insufficient to overcome this failure, indicating that raw capacity alone does not guarantee algorithmic generalization. We hypothesize that for abstract reasoning tasks such as modula… view at source ↗
Figure 5
Figure 5. Figure 5: PCA visualization of the Pre-SA-FFN layer’s activations. Each class is color-coded, with class centroids marked by circles. (a) The input activations—which correspond to the embed￾ding layer’s output—exhibit highly entangled representations, reflecting the high intra-class variance of the continuous input space. (b) Conversely, the output activations demonstrate clear, linearly sep￾arable clusters for each… view at source ↗
Figure 6
Figure 6. Figure 6: Causal Intervention on Internal Representations. (a) Correct image c perturbed with varying Gaussian noise levels (gray to green indicates high to low noise). (b) Entropy trajectories starting from an incorrect image c ′ perturbed with varying noise levels. The trajectories are color￾coded by the final ODE sampling accuracy, where green denotes successful rectification and red indicates the failure. The in… view at source ↗
Figure 7
Figure 7. Figure 7: Schematic of the Single-Layer Diffusion Transformer Architecture. The operands a, b and a Gaussian noise map are concatenated, patchified, and projected into the latent space. Within the self-attention block, the model computes components Av[a] and Av[b], which are then fused at the attention-FFN interface. This transition facilitates a quadratic mixture of operand features, enabling the emergence of the m… view at source ↗
Figure 8
Figure 8. Figure 8: Single-step sampling results 15 [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Embedding Activation of the operand a However, we observe the distinct significance of four specific frequencies–w1, w3, w7, and w9– within the self-attention blocks, as shown below. For these layers, we further provide activations at the individual attention head level. Notably, our analysis reveals that specific heads specialize in capturing a single, isolated frequency. (a) FVE barchart (b) Heatmap of a… view at source ↗
Figure 10
Figure 10. Figure 10: Attention Value Activation of the operand a 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Attention Key Activation of the operand a (a) FVE barchart (b) Heatmap of activations in head level [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Attention Score Activation of the operand a (a) FVE barchart (b) Heatmap of activations in neuron and head level [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Attention Av Activation of the operand a 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Attention Out Activation of the operation result c Finally, at the pre-Gelu activation stage of the FFN layer, the activation at position c is clearly composed of 2D FVEs corresponding to the selective frequencies shared throughout the preceding SA block. This characteristic motivated our focus on this specific layer, as it structurally represents the emergence of arithmetic generalization. (a) FVE barcha… view at source ↗
Figure 15
Figure 15. Figure 15: FFN Pre-GeLU Activation of the operation result c At this layer, we derived the Fourier basis and recovered the underlying trigonometric identities, as it exhibited the most significant periodic structures characterized by 2D signals. The following table presents the complete set of recovered identities, demonstrating a clear correlation with the addition of angular values. 18 [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 16
Figure 16. Figure 16: FFN Post-GeLU Activation of the operation result c Finally, we present the output activation of the final MLP layer. This activation is unpatchified by the network and mapped back into the image modality. As illustrated, the distinct frequency signals—previously dominant in the internal layers—are no longer present, indicating that the rep￾resentation has been fully transformed into the spatial domain for… view at source ↗
Figure 17
Figure 17. Figure 17: Final MLP Layer’s output activation on result position c C MULTIPLE-IMAGE REGIME DETAIL C.1 FFN SANDWICH ARCHITECTURE To facilitate the effective mapping of high-dimensional visual inputs into an algorithmic space, we adopt an auxiliary FFN layer situated between the embedding layer and the Self-Attention (SA) block. Visualization via PCA demonstrates that this ”Sandwich” architecture successfully project… view at source ↗
Figure 18
Figure 18. Figure 18: PCA at the operand a position at timestep = 0 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: PCA at the operand a position at timestep = 50 C.2 CONSTRUCTION OF THE EVALUATION DATASET To rigorously evaluate the model’s algorithmic generalization beyond simple visual mapping, we constructed a large-scale, non-redundant dataset for the N = 256 regime. The construction process followed a strict protocol to ensure both arithmetic coverage and visual diversity: 1. Visual Diversity: For each label in th… view at source ↗
Figure 20
Figure 20. Figure 20: Timestep 0 with accuracy 0% [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Timestep 21 with accuracy 7% 22 [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Timestep 32 with accuracy 54% [PITH_FULL_IMAGE:figures/full_fig_p023_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Timestep 50 with accuracy 96% 23 [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Timestep 0 with accuracy 0% [PITH_FULL_IMAGE:figures/full_fig_p024_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Timestep 21 with accuracy 0% 24 [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Timestep 32 with accuracy 54% [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Timestep 50 with accuracy 96% E ABLATION STUDY 1: VARIOUS P VALUES To verify that the periodic structures observed in the modular addition operation can be generalized, we first provide an ablation study on various values of P. We demonstrate that the 1D and 2D periodicities revealed by the Fourier analysis are also observable in the P = 27, 31, and 35 cases. Although the EMNIST dataset provides both uppe… view at source ↗
Figure 28
Figure 28. Figure 28: P = 23 [PITH_FULL_IMAGE:figures/full_fig_p026_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: P = 27 [PITH_FULL_IMAGE:figures/full_fig_p026_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: P = 31 [PITH_FULL_IMAGE:figures/full_fig_p026_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: P = 35 26 [PITH_FULL_IMAGE:figures/full_fig_p026_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: P = 27 (a) Correct result image recovery (b) Incorrect result image rectification [PITH_FULL_IMAGE:figures/full_fig_p027_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: P = 31 (a) Correct result image recovery (b) Incorrect result image rectification [PITH_FULL_IMAGE:figures/full_fig_p027_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: P = 35 F ABLATION STUDY 2: KUZUSHIJI-EMNIST DATASET We further argue that the emergence of periodic structures is not a dataset-specific phenomenon by providing an identical Fourier analysis on models trained on the Kuzushiji-MNIST dataset (Clanuwat et al., 2018). Because handwritten Japanese characters contain relatively more complex shapes than the English alphabet, this experiment strongly supports the… view at source ↗
Figure 35
Figure 35. Figure 35: P = 39 with Kuzushiji-MNIST dataset [PITH_FULL_IMAGE:figures/full_fig_p028_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: P = 43 with Kuzushiji-MNIST dataset [PITH_FULL_IMAGE:figures/full_fig_p028_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: P = 47 with Kuzushiji-MNIST dataset Figures 38∼ 40 demonstrate the mode shift between reasoning and denoising on the ODE sampling path on the model trained on the Kuzushiji-MNIST dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: P = 39 on Kuzushiji-MNIST dataset. (a) Correct result image recovery (b) Incorrect result image rectification [PITH_FULL_IMAGE:figures/full_fig_p029_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: P = 43 on Kuzushiji-MNIST dataset. (a) Correct result image recovery (b) Incorrect result image rectification [PITH_FULL_IMAGE:figures/full_fig_p029_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: P = 47 on Kuzushiji-MNIST dataset. G ABLATION STUDY 4: PHASE TRANSITION ACROSS VARIOUS P VALUES AND DATASETS Remarkably, across various P values and heterogeneous datasets, a sudden increase in FVE entropy—reflecting the collapse of concentration on selective frequencies—consistently coincides with the timestep at which the final ODE sampling accuracy drops to near zero (< 0.5%). We ar￾gue that this criti… view at source ↗
Figure 41
Figure 41. Figure 41: Alignment of Predicted and Observed Timesteps for the Phase Transition. This 3D visualization illustrates the incorrect image rectification dynamics detailed in Section 4.2. The left vertical plane displays ODE final accuracies across varying initial noise levels for models trained on different P values and datasets (EMNIST and KMNIST). The marked points denote the 0.5% accuracy threshold—the critical tim… view at source ↗
Figure 42
Figure 42. Figure 42: Grokking phenomena demonstrated on a standard depth-2 architecture. Validation accuracy trajectories show that successful generalization (grokking) occurs in both (a) the N = 1 single-image regime and (b) the N = 256 diverse-image regime, confirming that the phenomenon is not restricted to the single-layer FFN-sandwich model. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_42.png] view at source ↗
read the original abstract

Despite their empirical success, how diffusion models generalize remains poorly understood from a mechanistic perspective. We demonstrate that diffusion models trained with flow-matching objectives exhibit grokking--delayed generalization after overfitting--on modular addition, enabling controlled analysis of their internal computations. We study this phenomenon across two levels of data regime. In a single-image regime, mechanistic dissection reveals that the model implements modular addition by composing periodic representations of individual operands. In a diverse-image regime with high intraclass variability, we find that the model leverages its iterative sampling process to partition the task into an arithmetic computation phase followed by a visual denoising phase, separated by a critical timestep threshold. Our work provides the mechanistic decomposition of algorithmic learning in diffusion models, revealing how these models bridge continuous pixel-space generation and discrete symbolic reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that diffusion models trained with flow-matching objectives exhibit grokking on modular addition, enabling mechanistic analysis of their internal computations. In the single-image regime, the model implements the task by composing periodic representations of individual operands. In the diverse-image regime with high intraclass variability, the iterative sampling process partitions the task into an arithmetic computation phase followed by a visual denoising phase separated by a critical timestep threshold.

Significance. If the mechanistic interpretations are causally validated, the work offers a controlled decomposition of algorithmic learning inside continuous diffusion processes, showing how symbolic reasoning can emerge from pixel-space generation. The grokking lens for dissection is a useful methodological contribution that could generalize to other generative models.

major comments (2)
  1. [§4.1] §4.1 (single-image regime): The claim that modular addition is implemented via composition of periodic operand representations rests on observational dissection (likely Fourier analysis of activations). Without causal interventions such as targeted ablation or patching of the identified periodic components and measurement of the resulting drop in addition accuracy, it remains possible that these structures are encoding artifacts rather than functionally used computations.
  2. [§4.2] §4.2 (diverse-image regime): The reported separation of arithmetic computation before a critical timestep from subsequent visual denoising is identified via timestep sweeps or trajectory analysis. The manuscript must specify the quantitative criterion used to locate the threshold (e.g., accuracy curves or activation divergence) and include an intervention test (e.g., forcing early denoising or blocking arithmetic features at that timestep) to rule out visualization or encoding artifacts.
minor comments (2)
  1. The methods section should explicitly state the flow-matching loss formulation, network architecture details, and exact hyper-parameters used for both regimes to support reproducibility.
  2. Figure captions for grokking curves and activation visualizations would benefit from explicit scale bars and legends indicating what each color or marker represents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments correctly identify opportunities to strengthen the causal evidence for our mechanistic claims. We address each major point below and commit to the indicated revisions.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (single-image regime): The claim that modular addition is implemented via composition of periodic operand representations rests on observational dissection (likely Fourier analysis of activations). Without causal interventions such as targeted ablation or patching of the identified periodic components and measurement of the resulting drop in addition accuracy, it remains possible that these structures are encoding artifacts rather than functionally used computations.

    Authors: We agree that the analysis in §4.1 is observational and would benefit from causal validation. Our dissection relies on Fourier analysis of activations together with their correlation to task performance across training. In the revised version we will add targeted ablation experiments that zero or perturb the identified periodic components and quantify the resulting drop in modular addition accuracy. This will directly test whether the structures are functionally used rather than artifacts. revision: yes

  2. Referee: [§4.2] §4.2 (diverse-image regime): The reported separation of arithmetic computation before a critical timestep from subsequent visual denoising is identified via timestep sweeps or trajectory analysis. The manuscript must specify the quantitative criterion used to locate the threshold (e.g., accuracy curves or activation divergence) and include an intervention test (e.g., forcing early denoising or blocking arithmetic features at that timestep) to rule out visualization or encoding artifacts.

    Authors: We will revise §4.2 to state explicitly the quantitative criterion used to identify the critical timestep (the point at which arithmetic accuracy plateaus while image-quality metrics begin to improve, measured via separate probes). We will also add intervention experiments that either block arithmetic features or force early denoising at the identified threshold and report the resulting performance changes. These additions will rule out visualization artifacts. revision: yes

Circularity Check

0 steps flagged

Empirical mechanistic study of grokking in diffusion models contains no circular derivations

full rationale

The paper reports training diffusion models with flow-matching on modular addition, observes grokking behavior, and performs post-hoc mechanistic dissection of representations and timestep phases. No first-principles derivation, fitted parameter renamed as prediction, or self-citation chain is present; all claims rest on reproducible empirical observations of training dynamics and internal activations rather than any quantity defined in terms of itself. The work is self-contained against external benchmarks of model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, invented entities, or non-standard axioms beyond the standard assumption that flow-matching training produces the observed behavior.

axioms (1)
  • domain assumption Flow-matching objectives produce diffusion models capable of the described internal computations on modular addition
    Invoked as the training method that enables the grokking and mechanistic findings.

pith-pipeline@v0.9.0 · 5445 in / 1155 out tokens · 39648 ms · 2026-05-10T05:24:08.620930+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 13 internal anchors

  1. [1]

    & Mézard, M.Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in TrainingarXiv:2505.17638 [cs]

    Tony Bonnaire, Rapha¨el Urfin, Giulio Biroli, and Marc M´ezard. Why diffusion models don’t mem- orize: The role of implicit dynamical regularization in training.arXiv preprint arXiv:2505.17638,

  2. [2]

    On the edge of memorization in diffusion models.arXiv preprint arXiv:2508.17689, 2025

    Sam Buchanan, Druv Pai, Yi Ma, and Valentin De Bortoli. On the edge of memorization in diffusion models.arXiv preprint arXiv:2508.17689,

  3. [3]

    doi:10.23915/distill.00024 , note =

    doi: 10.23915/distill.00024. https://distill.pub/2020/circuits. Tarin Clanuwat, Mikel Bober-Irizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical japanese literature.CoRR, abs/1812.01718,

  4. [4]

    Deep Learning for Classical Japanese Literature

    URLhttp: //arxiv.org/abs/1812.01718. Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andr ´e van Schaik. Emnist: an extension of mnist to handwritten letters.arXiv preprint arXiv:1702.05373,

  5. [5]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

  6. [6]

    arXiv preprint arXiv:2303.06173 , year=

    Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,

  7. [7]

    URLhttps://arxiv.org/abs/ 2405.19201. Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kap...

  8. [8]

    9 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

    https://transformer-circuits.pub/2021/framework/index.html. 9 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition,

  9. [9]

    Deep Residual Learning for Image Recognition

    URLhttps://arxiv.org/abs/1512.03385. Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus),

  10. [10]

    Gaussian Error Linear Units (GELUs)

    URLhttps:// arxiv.org/abs/1606.08415. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851,

  11. [11]

    Generalization in diffusion models arises from geometry-adaptive harmonic representations

    Zahra Kadkhodaie, Florentin Guth, Eero P Simoncelli, and St ´ephane Mallat. Generalization in diffusion models arises from geometry-adaptive harmonic representations.arXiv preprint arXiv:2310.02557,

  12. [12]

    Diffusion models already have a semantic latent space.arXiv preprint arXiv:2210.10960,

    URLhttps://arxiv.org/abs/2210.10960. Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

  13. [13]

    Back to Basics: Let Denoising Generative Models Denoise

    URL https://arxiv.org/abs/2511.13720. Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  14. [14]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022a. Ziming Liu, Eric J Michaud, and Max Tegmark. Omnigrok: Grokking beyond algorithmic data. arXiv preprint arXiv:2210.01117, 2022b. Rui Lu, Runzhe Wang, Kaifeng Lyu, Xitai Jiang, Gao Huang, and...

  15. [15]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

  16. [16]

    URLhttps: //arxiv.org/abs/2310.09336. Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack ...

  17. [17]

    Core Francisco Park, Maya Okawa, Andrew Lee, Hidenori Tanaka, and Ekdeep Singh Lubana

    https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Core Francisco Park, Maya Okawa, Andrew Lee, Hidenori Tanaka, and Ekdeep Singh Lubana. Emergence of hidden capabilities: Exploring learning dynamics in concept space,

  18. [18]

    Emergence of

    URL https://arxiv.org/abs/2406.19370. Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space of diffusion models through the lens of riemannian geometry,

  19. [19]

    10 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 William Peebles and Saining Xie

    URLhttps: //arxiv.org/abs/2307.12868. 10 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pp. 4195–4205,

  20. [20]

    Memorization to generalization: Emergence of diffusion models from associative memory.arXiv preprint arXiv:2505.21777, 2025

    Bao Pham, Gabriel Raya, Matteo Negri, Mohammed J Zaki, Luca Ambrogioni, and Dmitry Krotov. Memorization to generalization: Emergence of diffusion models from associative memory.arXiv preprint arXiv:2505.21777,

  21. [21]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

  22. [22]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

  23. [23]

    GLU Variants Improve Transformer

    URLhttps://arxiv.org/abs/ 2002.05202. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learn- ing, pp. 2256–2265. pmlr,

  24. [24]

    Selective underfitting in diffusion models, 2025

    Kiwhan Song, Jaeyeon Kim, Sitan Chen, Yilun Du, Sham Kakade, and Vincent Sitzmann. Selective underfitting in diffusion models.arXiv preprint arXiv:2510.01378,

  25. [25]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456,

  26. [26]

    One-step is enough: Sparse autoencoders for text-to- image diffusion models.arXiv preprint arXiv:2410.22366, 2024

    URLhttps://arxiv.org/abs/2410.22366. Zhihua Tian, Sirun Nan, Ming Xu, Shengfang Zhai, Wenjie Qu, Jian Liu, Ruoxi Jia, and Jiaheng Zhang. Sparse autoencoder as a zero-shot classifier for concept erasing in text-to-image diffusion models,

  27. [27]

    Vikrant Varma, Rohin Shah, Zachary Kenton, J ´anos Kram ´ar, and Ramana Kumar

    URLhttps://arxiv.org/abs/2503.09446. Vikrant Varma, Rohin Shah, Zachary Kenton, J ´anos Kram ´ar, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

  28. [28]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

  29. [29]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    11 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Inter- pretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593,

  30. [30]

    Video models are zero-shot learners and reasoners

    Thadd¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners. arXiv preprint arXiv:2509.20328,

  31. [31]

    URLhttps://arxiv.org/abs/2307. 05596. Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870,

  32. [32]

    arXiv preprint arXiv:2303.13336 , year=

    Chenshuang Zhang, Chaoning Zhang, Sheng Zheng, Mengchun Zhang, Maryam Qamar, Sung-Ho Bae, and In So Kweon. A survey on audio diffusion models: Text to speech synthesis and en- hancement in generative ai.arXiv preprint arXiv:2303.13336,

  33. [33]

    gddim: Generalized denoising diffusion implicit models

    Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models.arXiv preprint arXiv:2206.05564,

  34. [34]

    Following thex 0-parameterization adopted in this work, the model directly predicts the clean imagex 0, which is related to the velocity byv θ(xt, t) = (x t −x 0)/t

    12 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 A IMPLEMENTATIONDETAILS We train our model using the Rectified Flow (RF) framework (Liu et al., 2022a) to predict the velocity vectorv θ(xt, t). Following thex 0-parameterization adopted in this work, the model directly predicts the clean imagex 0, which is related to the velocity byv θ(xt, t) =...

  35. [35]

    Feedforward NetworkThe FFN comprises 512 hidden neurons with GeLU activation (Hendrycks & Gimpel, 2023), maintaining a1×expansion ratio to balance capacity and simplic- ity

    to provide spatial context for the patchified tokens. Feedforward NetworkThe FFN comprises 512 hidden neurons with GeLU activation (Hendrycks & Gimpel, 2023), maintaining a1×expansion ratio to balance capacity and simplic- ity. While we explored more complex variants such as SwiGLU (Shazeer,

  36. [36]

    Table 2: Model and dataset hyperparameters

    for improved reconstruction, we found that the standard GeLU activation offered superior clarity for mechanistic interpretability, specifically in tracking the entropy transitions of pre-activation states. Table 2: Model and dataset hyperparameters. Parameter Value Dataset Modulus (P) 23 Images per symbol (N) 1, 4, 256 Training ratio (R) 0.9 Image resolut...

  37. [37]

    through dominant 2D Fourier compo- nents across selective frequencies (w1, w3, w7, w9) along with non-significant frequencies. W′L u⊤ k FFNpreact(a, b)andv⊤ k FFNpreact(a, b)FVE cos(w1(a+b)) 138910 cos(w 1a) cos(w1b)−139849 sin(w1a) sin(w1b)0.95 sin(w1(a+b)) 137133 cos(w 1a) sin(w1b) + 136206 sin(w1a) cos(w1b)0.94 cos(w2(a+b)) 939 cos(w 2a) cos(w2b)−426 s...

  38. [38]

    As in the baselineP= 23case, we observe similar trajectories for both the recovery of correct answers and the rectification of initially incorrect images

    Figure 29:P= 27 Figure 30:P= 31 Figure 31:P= 35 26 Published as a paper at the 2nd DeLTa Workshop, ICLR 2026 Figures 32–34 demonstrate the mode shift during ODE sampling with variousPvalues. As in the baselineP= 23case, we observe similar trajectories for both the recovery of correct answers and the rectification of initially incorrect images. (a) Correct...