pith. sign in

arxiv: 2605.15787 · v1 · pith:6SAZDN77new · submitted 2026-05-15 · 💻 cs.LG · cs.AI

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

Pith reviewed 2026-05-20 21:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords grokkingtransformersattentiongeneralizationbayesian inferencelottery ticketsstructural learningdelayed generalization
0
0 comments X

The pith

Transformers generalize only after attention performs Bayesian inference over the full task dependency graph, separate from MLP memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the long delay before a transformer generalizes, called grokking, occurs because attention and the feed-forward layers solve two different problems that become decoupled early in training. Attention must learn to place enough probability mass on every token that carries task-relevant information, which the authors treat as inferring a hidden dependency graph in a Bayesian way. Once the MLP drives loss near zero by memorizing examples without this structure, attention receives almost no further gradient signal, so weight decay has to first undo the memorization before the missing dependencies can be discovered. This structural waiting time produces the observed inverse dependence on weight decay strength. The account also predicts that an explicit KL penalty pushing attention toward the right structure can shorten the delay according to an inverse scaling law.

Core claim

We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding an

What carries the argument

The implicit Bayesian posterior over the task dependency graph that attention must learn to place sufficient mass on every informative token.

If this is right

  • Generalization separates into an MLP capacity condition and an attention structure condition.
  • The grokking delay equals the explaining-away waiting time after memorization is eroded by weight decay.
  • A KL-based structural intervention produces an inverse-intervention-strength scaling law for grokking time.
  • Bayesian lottery tickets achieve generalization performance matching or exceeding standard lottery-ticket transfer on algorithmic tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention layers in larger models may exhibit similar structural delays on natural-language tasks if no explicit pressure is applied to keep the dependency graph visible.
  • Architectures that maintain gradient flow to attention throughout training could eliminate grokking without relying on weight decay.
  • The same separation of concerns might appear in other attention-based sequence models whenever informative tokens can be ignored without immediate loss penalty.

Load-bearing premise

Attention behaves like a Bayesian update over task dependencies whose gradient signal disappears once the MLP has driven training loss to zero through memorization.

What would settle it

Running the proposed KL structural intervention on the algorithmic sequence tasks and finding that grokking time does not scale inversely with intervention strength would falsify the structural-inference account.

Figures

Figures reproduced from arXiv: 2605.15787 by Joseph An, Kai Hidajat, Solden Stoll.

Figure 1
Figure 1. Figure 1: Four Phases of Grokking. (Left) Test accuracy rises only after structural divergence DKL (red, dashed) has largely collapsed. (Right) The attention task gradient norm falls by orders of magnitude after memorization, producing the Explaining-Away Plateau. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Baseline 0 2000 4000 6000 8000 10000 Epoch Oracle Prior 0 2000 4000 6000 8000 1000… view at source ↗
Figure 2
Figure 2. Figure 2: Isolating Structure from Capacity. Training trajectories under independent control of the Norm Condition (N ) and Structural Condition (Bγ). Adversarial routing prevents generalization even under norm control, while oracle routing without norm control gives only partial generalization. Isolating Norm from Structure Theorem 4.3 states that both the Goldilocks Norm Condition (N ) and the Bayesian Structural … view at source ↗
Figure 3
Figure 3. Figure 3: KL Acceleration. (Left) Injecting the structural prior β accelerates generalization. (Right) The grokking delay ∆tgrok scales linearly with 1/β. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Baseline Lottery Ticket Ours Combined 0 500 1000 1500 2000 2500 LT Grokking Epoch 0 500 1000 1500 2000 2500 Ours Grokking Epoch [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Transferring the Bayesian Ticket. (Left) Regularizing with the structural prior (“Ours”) matches or outpaces transferring a full Lottery Ticket. (Right) The Bayesian Ticket matches or beats the Lottery Ticket across random initializations [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Explaining-away bound. The empirical attention-logit gradient remains below the theoretical bound from Lemma 5.1 in both baseline and KL-regularized training. The bound is conservative, but the qualitative statement is sharp: once cross-entropy is small, the task gradient into attention is tiny. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Regularizer Comparison Weight Decay Only… view at source ↗
Figure 6
Figure 6. Figure 6: Structural priors. (Left) Generic sparsity pressures help, but the KL prior reaches the generalizing solution fastest because it specifies where sparse mass should go. (Right) A learned teacher attention map transfers nearly the same acceleration as the oracle prior. F.2 Structural Priors, Sparse Priors, and Distillation These ablations clarify what the KL intervention is doing. Entropy and ℓ1 penalties ac… view at source ↗
Figure 7
Figure 7. Figure 7: Timing the structural intervention. (Left) A short early KL intervention almost matches an always-on prior, suggesting that the routing ticket persists after the prior is removed. (Right) Late injection still accelerates grokking, but the transition moves later with the activation epoch. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 Test Accuracy Grokking Dynamics (Dynamic Positions) Baseline K… view at source ↗
Figure 8
Figure 8. Figure 8: Routing under distractors. (Left) With informative positions randomized per sequence, a sequence-dependent structural prior α ∗ (s) still bypasses the plateau. (Right) Under added distrac￾tors, KL-trained models extrapolate better to longer contexts, although the longest lengths remain imperfect. 0 2000 4000 6000 8000 10000 Epoch 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Test Accuracy Impact of Embedding Dimension d=32 … view at source ↗
Figure 9
Figure 9. Figure 9: Geometry and distributed attention. (Left) Larger embedding dimension accelerates generalization, consistent with the geometric role of representation separation and subspace incoher￾ence. (Right) In a 4-head model, aggregate oracle attention mass rapidly approaches one; KL keeps this distributed mass more tightly aligned. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Task diversity. Modular addition gives the cleanest delayed transition, while sparse parity and permutation composition show noisier or faster transitions. Across all three tasks, the attention-gradient diagnostic remains bounded, and KL accelerates the structural route when a plateau is present. F.4 Task and Optimizer Robustness The task grid in [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Decoupled and optimizer controls. (Left) Once attention is frozen to oracle routing, the downstream network generalizes far earlier than the fully coupled baseline. (Right) MLP norm decay persists across batch-size and Adam momentum variants, supporting the qualitative norm-contraction assumption. G.2 Model Architecture and Hyperparameters Unless otherwise specified, we used a single-layer Transformer wit… view at source ↗
read the original abstract

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that grokking in Transformers results from delayed structural inference in attention. It formalizes attention as an implicit Bayesian posterior over the task dependency graph and proves that generalization requires two separable conditions: a Goldilocks bound on MLP capacity (aligning with norm-based accounts) and a Bayesian structural condition ensuring sufficient attention mass on every informative token. Early memorization drives cross-entropy near zero, starving attention of structural gradient; weight decay must then erode this memorization, yielding the observed inverse-weight-decay delay, which the authors derive as a structural waiting time. A KL-based structural intervention is shown to bypass the delay with an inverse-intervention-strength scaling law. Experiments on algorithmic sequence tasks isolate structure from capacity and indicate that the proposed Bayesian ticket matches or outperforms lottery-ticket transfer.

Significance. If the central derivations hold, the work provides a useful decoupling of capacity and structural conditions that could unify existing grokking explanations with a Bayesian view of attention. The explicit derivation of the waiting time as an explaining-away effect and the intervention scaling law are potentially valuable, as are the experiments that attempt to separate structural from capacity effects. The manuscript ships a falsifiable prediction (inverse-intervention-strength scaling) and reproducible experimental controls on algorithmic tasks.

major comments (3)
  1. [Formalization of attention] Formalization paragraph (beginning 'We formalize attention as an implicit Bayesian posterior'): the mapping from attention logits to an implicit posterior over the task dependency graph is asserted but not derived. Without the explicit posterior expression and the resulting gradient with respect to attention parameters, it remains unclear whether cross-entropy minimization produces strict gradient starvation once loss is small but nonzero, especially under multi-head or multi-layer interactions.
  2. [Derivation of structural waiting time] Derivation of structural waiting time (section presenting the inverse-weight-decay delay): the claim that the delay is a derived structural waiting time requires showing that the waiting-time expression is independent of the same fitted parameters used to define the model itself. The current presentation leaves open the possibility that the derived quantity reduces by construction to a reparameterization of the fitted weight-decay schedule.
  3. [Proofs of the two conditions] Proof of the two separable conditions (section asserting proofs of Goldilocks bound and Bayesian structural condition): the abstract states that both conditions are proved, yet the manuscript supplies no lemmas, equations, or explicit bounds. A load-bearing claim of separability therefore rests on an unshown argument; the experimental isolation of structure from capacity cannot substitute for the missing derivation.
minor comments (2)
  1. [Introduction / Related work] Notation for the Bayesian lottery ticket is introduced without a direct comparison table to the standard lottery-ticket hypothesis; a short side-by-side would clarify the claimed novelty.
  2. [Experiments] Figure captions for the algorithmic-task experiments should explicitly state the number of random seeds and whether error bars reflect standard deviation or standard error.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important areas where the formal arguments can be strengthened, and we will incorporate revisions to address each point explicitly while preserving the core contributions on decoupling capacity and structural conditions.

read point-by-point responses
  1. Referee: Formalization paragraph (beginning 'We formalize attention as an implicit Bayesian posterior'): the mapping from attention logits to an implicit posterior over the task dependency graph is asserted but not derived. Without the explicit posterior expression and the resulting gradient with respect to attention parameters, it remains unclear whether cross-entropy minimization produces strict gradient starvation once loss is small but nonzero, especially under multi-head or multi-layer interactions.

    Authors: We agree that the current presentation would benefit from greater explicitness. In the revised manuscript we will derive the mapping from attention logits to the implicit posterior over the task dependency graph, obtain the corresponding gradient with respect to attention parameters, and show that cross-entropy minimization produces gradient starvation for structural learning once the loss falls below a quantifiable threshold. The derivation will be extended to multi-head and multi-layer settings via appropriate product bounds on attention mass. revision: yes

  2. Referee: Derivation of structural waiting time (section presenting the inverse-weight-decay delay): the claim that the delay is a derived structural waiting time requires showing that the waiting-time expression is independent of the same fitted parameters used to define the model itself. The current presentation leaves open the possibility that the derived quantity reduces by construction to a reparameterization of the fitted weight-decay schedule.

    Authors: The structural waiting time is obtained from the explaining-away dynamics of the Bayesian posterior over the dependency graph and depends only on the structural parameters of that graph together with the weight-decay coefficient. We will add an explicit subsection that isolates these structural parameters from the learned MLP weights, thereby demonstrating that the waiting-time expression is not a reparameterization of the training schedule but a direct consequence of the attention mechanism's inference process. revision: yes

  3. Referee: Proof of the two separable conditions (section asserting proofs of Goldilocks bound and Bayesian structural condition): the abstract states that both conditions are proved, yet the manuscript supplies no lemmas, equations, or explicit bounds. A load-bearing claim of separability therefore rests on an unshown argument; the experimental isolation of structure from capacity cannot substitute for the missing derivation.

    Authors: We acknowledge that the initial submission presented the proofs at a high level. The Goldilocks bound on MLP capacity recovers known norm-based results, while the Bayesian structural condition follows from a lower bound on attention mass required for every informative token. In the revision we will supply the missing lemmas and explicit bounds that establish separability of the two conditions. The algorithmic-task experiments remain as empirical corroboration but will no longer be asked to stand in for the theoretical argument. revision: yes

Circularity Check

0 steps flagged

Derivation chain self-contained with independent Bayesian formalization

full rationale

The paper formalizes attention as an implicit Bayesian posterior over the task dependency graph, proves two separable conditions (Goldilocks MLP capacity bound plus Bayesian structural mass requirement), and derives the inverse-weight-decay delay as an explaining-away structural waiting time from early cross-entropy minimization starving structural gradients. No quoted equations or steps reduce the derived waiting time to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation or ansatz smuggled from prior work. The central decoupling supplies independent content beyond norm-based or lottery-ticket accounts, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on treating attention as Bayesian posterior inference over a task graph and on the assumption that early memorization removes gradient for that inference.

axioms (1)
  • domain assumption Attention implements an implicit Bayesian posterior over the task dependency graph
    Invoked to prove the structural condition and the explaining-away delay.
invented entities (1)
  • Bayesian lottery ticket no independent evidence
    purpose: Structural subnetwork that places mass on informative tokens
    Introduced to explain generalization after the memorization phase; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5768 in / 1173 out tokens · 36976 ms · 2026-05-20T21:07:44.457394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages

  1. [1]

    2024 , publisher =

    Abbe, Emmanuel and Boix-Adsera, Enric and Misiakiewicz, Theodor , title =. 2024 , publisher =

  2. [2]

    2024 , publisher =

    Ahn, Kwangjun and Cheng, Xiang and Song, Minhak and Yun, Chulhee and Jadbabaie, Ali and Sra, Suvrit , title =. 2024 , publisher =

  3. [3]

    2019 , publisher =

    Allen-Zhu, Zeyuan and Li, Yuanzhi and Song, Zhao , title =. 2019 , publisher =

  4. [4]

    The Annals of Statistics , volume =

    Anandkumar, Animashree and Valluvan, Ragupathyraj , title =. The Annals of Statistics , volume =. 2013 , url =

  5. [5]

    2023 , publisher =

    Bai, Yu and Chen, Fan and Wang, Huan and Xiong, Caiming and Mei, Song , title =. 2023 , publisher =

  6. [6]

    and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =

    Barak, Boaz and Edelman, Benjamin L. and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =. 2023 , publisher =

  7. [7]

    , title =

    Barnfield, Nicholas and Cui, Hugo and Lu, Yue M. , title =. 2025 , publisher =

  8. [8]

    2023 , publisher =

    Battiloro, Claudio and Spinelli, Indro and Telyatnikov, Lev and Bronstein, Michael and Scardapane, Simone and Lorenzo, Paolo Di , title =. 2023 , publisher =

  9. [9]

    and Kucukelbir, Alp and McAuliffe, Jon D

    Blei, David M. and Kucukelbir, Alp and McAuliffe, Jon D. , title =. Journal of the American Statistical Association , volume =. 2017 , pages =

  10. [10]

    2025 , publisher =

    Borde, Haitz Sáez de Ocáriz and Kratsios, Anastasis , title =. 2025 , publisher =

  11. [11]

    2025 , publisher =

    Boursier, Etienne and Pesme, Scott and Dragomir, Radu-Alexandru , title =. 2025 , publisher =

  12. [12]

    2025 , publisher =

    Chen, Zheng-An and Luo, Tao , title =. 2025 , publisher =

  13. [13]

    2018 , publisher =

    Chizat, Lenaic and Bach, Francis , title =. 2018 , publisher =

  14. [14]

    Choi, Myung Jin and Tan, Vincent Y. F. and Anandkumar, Animashree and Willsky, Alan S. , title =. 2010 , publisher =

  15. [15]

    2024 , publisher =

    Clauw, Kenzo and Stramaglia, Sebastiano and Marinazzo, Daniele , title =. 2024 , publisher =

  16. [16]

    Transactions on Machine Learning Research , year =

    Darvariu, Victor-Alexandru and Hailes, Stephen and Musolesi, Mirco , title =. Transactions on Machine Learning Research , year =

  17. [17]

    2023 , publisher =

    Davies, Xander and Langosco, Lauro and Krueger, David , title =. 2023 , publisher =

  18. [18]

    2025 , publisher =

    Deng, Yichuan and Song, Zhao and Xiong, Jing and Yang, Chiwun , title =. 2025 , publisher =

  19. [19]

    IEEE Signal Processing Magazine , volume =

    Dong, Xiaowen and Thanou, Dorina and Rabbat, Michael and Frossard, Pascal , title =. IEEE Signal Processing Magazine , volume =. 2019 , pages =

  20. [20]

    and Lee, Jason D

    Du, Simon S. and Lee, Jason D. and Li, Haochuan and Wang, Liwei and Zhai, Xiyu , title =. 2019 , publisher =

  21. [21]

    2024 , publisher =

    DuSell, Brian and Chiang, David , title =. 2024 , publisher =

  22. [22]

    2020 , publisher =

    Ebli, Stefania and Defferrard, Michaël and Spreemann, Gard , title =. 2020 , publisher =

  23. [23]

    2021 , publisher =

    Fatemi, Bahare and Asri, Layla El and Kazemi, Seyed Mehran , title =. 2021 , publisher =

  24. [24]

    2020 , publisher =

    Franceschi, Luca and Niepert, Mathias and Pontil, Massimiliano and He, Xiao , title =. 2020 , publisher =

  25. [25]

    2019 , publisher =

    Frankle, Jonathan and Carbin, Michael , title =. 2019 , publisher =

  26. [26]

    2024 , publisher =

    Golechha, Satvik , title =. 2024 , publisher =

  27. [27]

    2019 , publisher =

    Grover, Aditya and Zweig, Aaron and Ermon, Stefano , title =. 2019 , publisher =

  28. [28]

    2023 , publisher =

    Gu, Ming and Yang, Gaoming and Zhou, Sheng and Ma, Ning and Chen, Jiawei and Tan, Qiaoyu and Liu, Meihan and Bu, Jiajun , title =. 2023 , publisher =

  29. [29]

    2023 , url =

    Gurugubelli, Sravanthi and Chepuri, Sundeep Prabhakar , title =. 2023 , url =

  30. [30]

    2022 , publisher =

    Hu, Xiaoling and Samaras, Dimitris and Chen, Chao , title =. 2022 , publisher =

  31. [31]

    2025 , publisher =

    Jeffares, Alan and Schaar, Mihaela van der , title =. 2025 , publisher =

  32. [32]

    2020 , publisher =

    Katharopoulos, Angelos and Vyas, Apoorv and Pappas, Nikolaos and Fleuret, François , title =. 2020 , publisher =

  33. [33]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =

    Kazi, Anees and Cosmo, Luca and Ahmadi, Seyed-Ahmad and Navab, Nassir and Bronstein, Michael , title =. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume =. 2023 , pages =

  34. [34]

    2026 , publisher =

    Khanh, Truong Xuan and Hoa, Truong Quynh and Trung, Luu Duc and Duc, Phan Thanh , title =. 2026 , publisher =

  35. [35]

    2018 , publisher =

    Kipf, Thomas and Fetaya, Ethan and Wang, Kuan-Chieh and Welling, Max and Zemel, Richard , title =. 2018 , publisher =

  36. [36]

    , title =

    Korbak, Tomasz and Perez, Ethan and Buckley, Christopher L. , title =. 2022 , publisher =

  37. [37]

    and Palomar, Daniel P

    Kumar, Sandeep and Ying, Jiaxi and Cardoso, José Vinícius de M. and Palomar, Daniel P. , title =. Journal of Machine Learning Research , volume =. 2020 , pages =

  38. [38]

    Advances in Neural Information Processing Systems , volume =

    Kumar, Sandeep and Ying, Jiaxi and de Miranda Cardoso, Jose Vinicius and Palomar, Daniel , title =. Advances in Neural Information Processing Systems , volume =. 2019 , publisher =

  39. [39]

    and Pehlevan, Cengiz , title =

    Kumar, Tanishq and Bordelon, Blake and Gershman, Samuel J. and Pehlevan, Cengiz , title =. 2024 , publisher =

  40. [40]

    2020 , publisher =

    Lachapelle, Sébastien and Brouillard, Philippe and Deleu, Tristan and Lacoste-Julien, Simon , title =. 2020 , publisher =

  41. [41]

    2025 , publisher =

    Lapenna, Michela and Bacco, Caterina De , title =. 2025 , publisher =

  42. [42]

    2024 , publisher =

    Lee, Jaerin and Kang, Bong Gyun and Kim, Kihoon and Lee, Kyoung Mu , title =. 2024 , publisher =

  43. [43]

    2020 , publisher =

    Li, Zongyi and Kovachki, Nikola and Azizzadenesheli, Kamyar and Liu, Burigede and Bhattacharya, Kaushik and Stuart, Andrew and Anandkumar, Anima , title =. 2020 , publisher =

  44. [44]

    and Tegmark, Max and Williams, Mike , title =

    Liu, Ziming and Kitouni, Ouail and Nolte, Niklas and Michaud, Eric J. and Tegmark, Max and Williams, Mike , title =. 2022 , publisher =

  45. [45]

    and Tegmark, Max , title =

    Liu, Ziming and Michaud, Eric J. and Tegmark, Max , title =. 2023 , publisher =

  46. [46]

    2021 , publisher =

    Lorch, Lars and Rothfuss, Jonas and Schölkopf, Bernhard and Krause, Andreas , title =. 2021 , publisher =

  47. [47]

    2019 , publisher =

    Loshchilov, Ilya and Hutter, Frank , title =. 2019 , publisher =

  48. [48]

    2023 , publisher =

    Lu, Jianglin and Xu, Yi and Wang, Huan and Bai, Yue and Fu, Yun , title =. 2023 , publisher =

  49. [49]

    and Lee, Jason D

    Lyu, Kaifeng and Jin, Jikai and Li, Zhiyuan and Du, Simon S. and Lee, Jason D. and Hu, Wei , title =. 2024 , publisher =

  50. [50]

    2020 , publisher =

    Lyu, Kaifeng and Li, Jian , title =. 2020 , publisher =

  51. [51]

    2025 , publisher =

    Maasch, Jacqueline and Neiswanger, Willie and Ermon, Stefano and Kuleshov, Volodymyr , title =. 2025 , publisher =

  52. [52]

    2025 , publisher =

    Manenti, Alessandro and Zambon, Daniele and Alippi, Cesare , title =. 2025 , publisher =

  53. [53]

    2012 , publisher =

    Mansinghka, Vikash and Kemp, Charles and Griffiths, Thomas and Tenenbaum, Joshua , title =. 2012 , publisher =

  54. [54]

    2025 , publisher =

    Marinucci, Lorenzo and Nino, Leonardo Di and D’Acunto, Gabriele and Pandolfo, Mario Edoardo and Lorenzo, Paolo Di and Barbarossa, Sergio , title =. 2025 , publisher =

  55. [55]

    and Ribeiro, Alejandro , title =

    Mateos, Gonzalo and Segarra, Santiago and Marques, Antonio G. and Ribeiro, Alejandro , title =. IEEE Signal Processing Magazine , volume =. 2019 , pages =

  56. [56]

    2019 , publisher =

    McKenna, Ryan and Sheldon, Daniel and Miklau, Gerome , title =. 2019 , publisher =

  57. [57]

    Proceedings of the National Academy of Sciences , volume =

    Mei, Song and Montanari, Andrea and Nguyen, Phan-Minh , title =. Proceedings of the National Academy of Sciences , volume =. 2018 , url =

  58. [58]

    2023 , publisher =

    Merrill, William and Tsilivis, Nikolaos and Shukla, Aman , title =. 2023 , publisher =

  59. [59]

    2025 , publisher =

    Minegishi, Gouki and Iwasawa, Yusuke and Matsuo, Yutaka , title =. 2025 , publisher =

  60. [60]

    , title =

    Mousavi-Hosseini, Alireza and Sanford, Clayton and Wu, Denny and Erdogdu, Murat A. , title =. 2025 , publisher =

  61. [61]

    , title =

    Murty, Shikhar and Sharma, Pratyusha and Andreas, Jacob and Manning, Christopher D. , title =. 2023 , publisher =

  62. [62]

    2026 , publisher =

    Musat, Tiberiu , title =. 2026 , publisher =

  63. [63]

    2023 , publisher =

    Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. 2023 , publisher =

  64. [64]

    2025 , publisher =

    Notsawo, Pascal Jr Tikeng and Dumas, Guillaume and Rabusseau, Guillaume , title =. 2025 , publisher =

  65. [65]

    2023 , publisher =

    Oymak, Samet and Rawat, Ankit Singh and Soltanolkotabi, Mahdi and Thrampoulidis, Christos , title =. 2023 , publisher =

  66. [66]

    IDMT-Traffic: An Open Bench- mark Dataset for Acoustic Traffic Monitoring Research

    Pastorino, Martina and Moser, Gabriele and Serpico, Sebastiano B. and Zerubia, Josiane , title =. 2021 29th European Signal Processing Conference (EUSIPCO) , year =. doi:10.23919/EUSIPCO54536.2021.9616179 , address =

  67. [67]

    1988 , publisher =

    Pearl, Judea , title =. 1988 , publisher =

  68. [68]

    2009 , publisher =

    Pearl, Judea , title =. 2009 , publisher =

  69. [69]

    2021 , publisher =

    Pezeshki, Mohammad and Kaba, Sékou-Oumar and Bengio, Yoshua and Courville, Aaron and Precup, Doina and Lajoie, Guillaume , title =. 2021 , publisher =

  70. [70]

    2022 , publisher =

    Phuong, Mary and Hutter, Marcus , title =. 2022 , publisher =

  71. [71]

    2022 , publisher =

    Power, Alethea and Burda, Yuri and Edwards, Harri and Babuschkin, Igor and Misra, Vedant , title =. 2022 , publisher =

  72. [72]

    Prieto, Lucas and Barsbey, Melih and Mediano, Pedro A. M. and Birdal, Tolga , title =. 2025 , publisher =

  73. [73]

    2021 , publisher =

    Pu, Xingyue and Cao, Tianyue and Zhang, Xiaoyun and Dong, Xiaowen and Chen, Siheng , title =. 2021 , publisher =

  74. [74]

    IEEE Access , volume =

    Ryu, Junseung and Cho, Namkyeong and Hwang, Hyung Ju , title =. IEEE Access , volume =. 2025 , pages =

  75. [75]

    , title =

    Sanchez-Lengeling, Benjamin and Reif, Emily and Pearce, Adam and Wiltschko, Alexander B. , title =. Distill , volume =. 2021 , pages =

  76. [76]

    2023 , publisher =

    Sanford, Clayton and Hsu, Daniel and Telgarsky, Matus , title =. 2023 , publisher =

  77. [77]

    and Gori, M

    Scarselli, F. and Gori, M. and Ah Chung Tsoi and Hagenbuchner, M. and Monfardini, G. , title =. IEEE Transactions on Neural Networks , volume =. 2009 , pages =

  78. [78]

    2021 , publisher =

    Schölkopf, Bernhard and Locatello, Francesco and Bauer, Stefan and Ke, Nan Rosemary and Kalchbrenner, Nal and Goyal, Anirudh and Bengio, Yoshua , title =. 2021 , publisher =

  79. [79]

    and Mateos, Gonzalo and Ribeiro, Alejandro , title =

    Segarra, Santiago and Marques, Antonio G. and Mateos, Gonzalo and Ribeiro, Alejandro , title =. 2016 , publisher =

  80. [80]

    2025 , publisher =

    Si, Chongjie and Zhang, Debing and Shen, Wei , title =. 2025 , publisher =

Showing first 80 references.