pith. machine review for the scientific record. sign in

arxiv: 2301.05217 · v3 · submitted 2023-01-12 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

Progress measures for grokking via mechanistic interpretability

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords grokkingmechanistic interpretabilitymodular additionFourier transformprogress measurestransformer circuitsemergent behaviortraining dynamics
0
0 comments X

The pith

Transformers on modular addition learn a Fourier rotation algorithm that gradually replaces memorization during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reverse-engineers small transformers trained on modular addition and shows they implement addition by converting inputs to a circle via discrete Fourier transforms and using trigonometric identities to rotate one number by the other. This structured circuit forms gradually rather than appearing suddenly. The authors track the process with progress measures that split training into memorization, circuit formation, and cleanup phases. Grokking therefore reflects the slow amplification of the Fourier mechanism in the weights followed by removal of the memorizing components that were learned first.

Core claim

Small transformers trained on modular addition fully implement the task by representing numbers as points on a circle in Fourier space, rotating one input by the angle given by the other, and reading out the result; this circuit is visible in the weights and activations, confirmed by Fourier-space ablations, and its gradual growth plus later cleanup of memorization circuits produces the grokking phenomenon.

What carries the argument

The discrete Fourier transform circuit that encodes inputs on a circle and performs addition via rotation using trigonometric identities.

If this is right

  • Training dynamics divide into three phases: early memorization, gradual circuit formation, and late cleanup of memorizing components.
  • Progress measures based on Fourier components and weight norms track the continuous growth of the structured algorithm.
  • Grokking is not a discontinuous jump but the point at which the Fourier circuit overtakes memorization in accuracy.
  • The same reverse-engineering approach can identify similar circuits in other algorithmic tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar progress measures could be defined for other emergent behaviors by first reverse-engineering the underlying circuit.
  • The cleanup phase suggests that regularization or longer training might systematically prune memorization in favor of structured solutions.
  • If the Fourier mechanism generalizes, modular arithmetic tasks could serve as a testbed for studying how networks discover group representations.

Load-bearing premise

The Fourier rotation circuit is the main mechanism responsible for the network's behavior once grokking occurs.

What would settle it

Ablating the identified Fourier components in the weights and activations leaves the network still able to compute modular addition correctly.

read the original abstract

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that grokking in small transformers trained on modular addition arises from a fully reverse-engineered circuit that implements addition via discrete Fourier transforms and trigonometric identities (converting the operation to rotation on a circle). This is validated by direct analysis of weights and activations plus targeted ablations in Fourier space. The circuit understanding is then used to construct progress measures that decompose training into three continuous phases—memorization, circuit formation, and cleanup—showing that generalization emerges gradually from amplification of structured weight components rather than a discontinuous jump.

Significance. If the reverse-engineering and ablation results hold, the work supplies a concrete, mechanistically grounded example of progress measures for an emergent phenomenon. The explicit circuit identification (resting on external Fourier identities rather than post-hoc fitting) and the resulting phase decomposition offer a template for using interpretability to track training dynamics, with the low circularity of the measures being a notable strength.

major comments (2)
  1. [§4 (Fourier ablations)] §4 (Fourier ablations): The ablations in Fourier space produce a large drop in test accuracy, but the manuscript does not report the accuracy of the residual non-Fourier components or test whether any parallel non-ablated computations remain active and contribute to generalization. This is load-bearing for the central claim of complete reverse-engineering.
  2. [§5 (Progress measures)] §5 (Progress measures): The three-phase decomposition (memorization, circuit formation, cleanup) is defined from the identified Fourier components. If the ablation does not fully isolate the circuit, the measures may track only a subset of the mechanisms driving the observed test-accuracy curve, weakening the claim that grokking is fully explained by gradual circuit amplification.
minor comments (2)
  1. [Figures 3 and 4] Figure 3 and 4: axis labels and phase boundaries could be annotated more explicitly to make the correspondence between progress-measure curves and the three phases immediately visible.
  2. [Abstract] The abstract's phrasing 'we fully reverse engineer' is strong given the residual-performance question above; a minor softening would better match the evidence presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review and positive recommendation for minor revision. We appreciate the focus on strengthening the evidence for complete reverse-engineering and its implications for the progress measures. We address each major comment below.

read point-by-point responses
  1. Referee: [§4 (Fourier ablations)] §4 (Fourier ablations): The ablations in Fourier space produce a large drop in test accuracy, but the manuscript does not report the accuracy of the residual non-Fourier components or test whether any parallel non-ablated computations remain active and contribute to generalization. This is load-bearing for the central claim of complete reverse-engineering.

    Authors: We agree that reporting the test accuracy of the residual non-Fourier components would provide stronger validation for the completeness of our reverse-engineering. In the revised manuscript, we will add this analysis, showing that the non-Fourier residual achieves high training accuracy (consistent with memorization) but near-chance test accuracy on unseen inputs. This demonstrates that no significant parallel non-Fourier computations contribute to generalization, supporting that the Fourier circuit accounts for the learned algorithm. revision: yes

  2. Referee: [§5 (Progress measures)] §5 (Progress measures): The three-phase decomposition (memorization, circuit formation, cleanup) is defined from the identified Fourier components. If the ablation does not fully isolate the circuit, the measures may track only a subset of the mechanisms driving the observed test-accuracy curve, weakening the claim that grokking is fully explained by gradual circuit amplification.

    Authors: We maintain that the ablation results (near-total drop in test accuracy upon Fourier component removal) provide strong evidence that the identified circuit is the primary driver, allowing the progress measures to track the full generalization dynamics. However, to further address potential concerns about isolation, we will partially revise §5 by adding explicit correlations between the progress measures and test accuracy, along with controls showing that the phase transitions align specifically with changes in the Fourier components rather than other weight statistics. revision: partial

Circularity Check

0 steps flagged

No significant circularity; circuit identified via independent analysis before defining progress measures

full rationale

The paper first reverse-engineers the network's algorithm through direct inspection of weights, activations, and Fourier-space ablations, grounding the DFT-plus-trigonometric circuit in external mathematical identities rather than in the grokking dynamics or any fitted progress measures. Only after this identification do the authors define the three phases (memorization, circuit formation, cleanup) as derived quantities to track training. No step reduces by construction to its own inputs, no parameters are fitted to the target curve and relabeled as predictions, and no load-bearing claim rests on self-citation chains. The derivation remains self-contained and externally falsifiable via ablation experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard properties of the discrete Fourier transform over finite fields and the assumption that the network's learned weights implement exactly the identified rotation mechanism once the circuit forms.

axioms (2)
  • standard math The discrete Fourier transform converts modular addition into component-wise multiplication (rotation) in frequency space.
    Invoked in the reverse-engineering section to map the addition task onto trigonometric identities.
  • domain assumption Ablations performed by zeroing specific frequency components isolate the causal mechanism without introducing new artifacts.
    Used to confirm the circuit; this is a standard mechanistic-interpretability assumption rather than a paper-specific invention.

pith-pipeline@v0.9.0 · 5498 in / 1301 out tokens · 40356 ms · 2026-05-14T21:47:29.091836+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

    cs.LG 2026-05 unverdicted novelty 8.0

    In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...

  2. When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

    cs.LG 2026-05 unverdicted novelty 7.0

    Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

  3. Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers

    cs.AI 2026-05 conditional novelty 7.0

    The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.

  4. Interpreting Reinforcement Learning Agents with Susceptibilities

    cs.LG 2026-05 unverdicted novelty 7.0

    Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

  5. The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

    cs.LG 2026-05 unverdicted novelty 7.0

    Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.

  6. ILDR: Geometric Early Detection of Grokking

    cs.LG 2026-04 unverdicted novelty 7.0

    ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.

  7. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  8. Dimensional Criticality at Grokking Across MLPs and Transformers

    cs.LG 2026-04 unverdicted novelty 7.0

    Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.

  9. The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

    cs.LG 2026-03 unverdicted novelty 7.0

    The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...

  10. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  11. Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

    cs.LG 2026-05 unverdicted novelty 6.0

    Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.

  12. Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory

    cs.LG 2026-05 unverdicted novelty 6.0

    A Random Matrix Theory method identifies growing Correlation Traps in neural network weight spectra during an 'anti-grokking' overfitting phase, and applies the same diagnostic to some foundation LLMs.

  13. Not How Many, But Which: Parameter Placement in Low-Rank Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.

  14. Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization

    stat.ML 2026-05 unverdicted novelty 6.0

    Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...

  15. Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

    cs.LG 2026-04 unverdicted novelty 6.0

    Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...

  16. Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

    cs.LG 2026-04 unverdicted novelty 6.0

    Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...

  17. LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces

    cs.CL 2026-04 unverdicted novelty 6.0

    LAG-XAI treats paraphrasing as affine flows in semantic manifolds using Lie-inspired approximations, achieving AUC 0.7713 on paraphrase detection and 95.3% hallucination detection on HaluEval.

  18. Grokking as Dimensional Phase Transition in Neural Networks

    cs.LG 2026-04 unverdicted novelty 6.0

    Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.

  19. PhiNet: Speaker Verification with Phonetic Interpretability

    eess.AS 2026-04 unverdicted novelty 6.0

    PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.

  20. Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

    cs.LG 2026-05 unverdicted novelty 5.0

    Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

  21. Emergent Semantic Role Understanding in Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned m...

  22. Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance

    cs.AI 2026-05 unverdicted novelty 4.0

    AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.

  23. Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking

    cs.LG 2026-04 unverdicted novelty 4.0

    Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.

  24. There Will Be a Scientific Theory of Deep Learning

    stat.ML 2026-04 unverdicted novelty 2.0

    A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 22 Pith papers · 7 internal anchors

  1. [1]

    More is different for AI , url=

    Steinhardt, Jacob , year=. More is different for AI , url=. Bounded Regret , publisher=

  2. [3]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  3. [4]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  4. [9]

    2022 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Predictability and surprise in large generative models , author=. 2022 ACM Conference on Fairness, Accountability, and Transparency , pages=

  5. [10]

    Beren's Blog - Thoughts on AI, Neuroscience, and other things that interest me

    Grokking 'grokking' , url=. Beren's Blog - Thoughts on AI, Neuroscience, and other things that interest me. , author=

  6. [13]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. arXiv preprint arXiv:2206.04615 , year=

  7. [15]

    Distill , year =

    Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , title =. Distill , year =

  8. [16]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  9. [17]

    2022 , journal=

    In-context Learning and Induction Heads , author=. 2022 , journal=

  10. [18]

    arXiv preprint arXiv:2110.07483 , year=

    On the pitfalls of analyzing individual neurons in language models , author=. arXiv preprint arXiv:2110.07483 , year=

  11. [20]

    Advances in neural information processing systems , volume=

    Comparing biases for minimal network construction with back-propagation , author=. Advances in neural information processing systems , volume=

  12. [22]

    Locating and Editing Factual Associations in GPT, January 2023

    Locating and Editing Factual Associations in GPT , author=. arXiv preprint arXiv:2202.05262 , year=

  13. [23]

    IEEE Transactions on Information Theory , volume=

    Comparing measures of sparsity , author=. IEEE Transactions on Information Theory , volume=. 2009 , publisher=

  14. [25]

    The journal of machine learning research , volume=

    Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=

  15. [26]

    Advances in neural information processing systems , volume=

    Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

  16. [27]

    2015 , url =

    Collaborative data science , publisher =. 2015 , url =

  17. [29]

    International Conference on Learning Representations , year=

    Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation , author=. International Conference on Learning Representations , year=

  18. [30]

    Hidden progress in deep learning: Sgd learns parities near the computational limit

    Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv preprint arXiv:2207.08799, 2022

  19. [31]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  20. [32]

    Thread: Circuits

    Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits. Distill, 2020. doi:10.23915/distill.00024. https://distill.pub/2020/circuits

  21. [33]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  22. [34]

    The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018

  23. [35]

    Predictability and surprise in large generative models

    Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.\ 1747--1764, 2022

  24. [36]

    R., Millman, K

    Charles R. Harris, K. Jarrod Millman, St \' e fan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern \' a ndez del R \' i o, Mark Wiebe, Pearu Peterson, Pierre G \' e rard-M...

  25. [37]

    Comparing measures of sparsity

    Niall Hurley and Scott Rickard. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55 0 (10): 0 4723--4741, 2009

  26. [38]

    Nolte, Eric J

    Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning. arXiv preprint arXiv:2205.10343, 2022

  27. [39]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  28. [40]

    Acquisition of chess knowledge in alphazero

    Thomas McGrath, Andrei Kapishnikov, Nenad Toma s ev, Adam Pearce, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero. arXiv preprint arXiv:2111.09259, 2021

  29. [41]

    Grokking 'grokking', 2022

    Beren Millidge. Grokking 'grokking', 2022. URL https://www.beren.io/2022-01-11-Grokking-Grokking/

  30. [42]

    In-context learning and induction heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  31. [43]

    The effects of reward misspecification: Mapping and mitigating misaligned models

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022

  32. [44]

    Pytorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  33. [45]

    Collaborative data science, 2015

    Plotly Technologies Inc. Collaborative data science, 2015. URL https://plot.ly

  34. [46]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022

  35. [47]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  36. [48]

    Einops: Clear and reliable tensor manipulations with einstein-like notation

    Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oapKSVM2bcj

  37. [49]

    Dropout: a simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15 0 (1): 0 1929--1958, 2014

  38. [50]

    More is different for ai, Feb 2022

    Jacob Steinhardt. More is different for ai, Feb 2022. URL https://bounded-regret.ghost.io/more-is-different-for-ai/

  39. [51]

    The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon

    Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022

  40. [52]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022

  41. [53]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022 a

  42. [54]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022 b

  43. [55]

    2010 , keywords =

    W es M c K inney. D ata S tructures for S tatistical C omputing in P ython. In S t\'efan van der W alt and J arrod M illman (eds.), P roceedings of the 9th P ython in S cience C onference , pp.\ 56 -- 61, 2010. doi:10.25080/Majora-92bf1922-00a