Recognition: 3 theorem links
· Lean TheoremProgress measures for grokking via mechanistic interpretability
Pith reviewed 2026-05-14 21:47 UTC · model grok-4.3
The pith
Transformers on modular addition learn a Fourier rotation algorithm that gradually replaces memorization during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Small transformers trained on modular addition fully implement the task by representing numbers as points on a circle in Fourier space, rotating one input by the angle given by the other, and reading out the result; this circuit is visible in the weights and activations, confirmed by Fourier-space ablations, and its gradual growth plus later cleanup of memorization circuits produces the grokking phenomenon.
What carries the argument
The discrete Fourier transform circuit that encodes inputs on a circle and performs addition via rotation using trigonometric identities.
If this is right
- Training dynamics divide into three phases: early memorization, gradual circuit formation, and late cleanup of memorizing components.
- Progress measures based on Fourier components and weight norms track the continuous growth of the structured algorithm.
- Grokking is not a discontinuous jump but the point at which the Fourier circuit overtakes memorization in accuracy.
- The same reverse-engineering approach can identify similar circuits in other algorithmic tasks.
Where Pith is reading between the lines
- Similar progress measures could be defined for other emergent behaviors by first reverse-engineering the underlying circuit.
- The cleanup phase suggests that regularization or longer training might systematically prune memorization in favor of structured solutions.
- If the Fourier mechanism generalizes, modular arithmetic tasks could serve as a testbed for studying how networks discover group representations.
Load-bearing premise
The Fourier rotation circuit is the main mechanism responsible for the network's behavior once grokking occurs.
What would settle it
Ablating the identified Fourier components in the weights and activations leaves the network still able to compute modular addition correctly.
read the original abstract
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress measures} that underlie the seemingly discontinuous qualitative changes. We argue that progress measures can be found via mechanistic interpretability: reverse-engineering learned behaviors into their individual components. As a case study, we investigate the recently-discovered phenomenon of ``grokking'' exhibited by small transformers trained on modular addition tasks. We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle. We confirm the algorithm by analyzing the activations and weights and by performing ablations in Fourier space. Based on this understanding, we define progress measures that allow us to study the dynamics of training and split training into three continuous phases: memorization, circuit formation, and cleanup. Our results show that grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that grokking in small transformers trained on modular addition arises from a fully reverse-engineered circuit that implements addition via discrete Fourier transforms and trigonometric identities (converting the operation to rotation on a circle). This is validated by direct analysis of weights and activations plus targeted ablations in Fourier space. The circuit understanding is then used to construct progress measures that decompose training into three continuous phases—memorization, circuit formation, and cleanup—showing that generalization emerges gradually from amplification of structured weight components rather than a discontinuous jump.
Significance. If the reverse-engineering and ablation results hold, the work supplies a concrete, mechanistically grounded example of progress measures for an emergent phenomenon. The explicit circuit identification (resting on external Fourier identities rather than post-hoc fitting) and the resulting phase decomposition offer a template for using interpretability to track training dynamics, with the low circularity of the measures being a notable strength.
major comments (2)
- [§4 (Fourier ablations)] §4 (Fourier ablations): The ablations in Fourier space produce a large drop in test accuracy, but the manuscript does not report the accuracy of the residual non-Fourier components or test whether any parallel non-ablated computations remain active and contribute to generalization. This is load-bearing for the central claim of complete reverse-engineering.
- [§5 (Progress measures)] §5 (Progress measures): The three-phase decomposition (memorization, circuit formation, cleanup) is defined from the identified Fourier components. If the ablation does not fully isolate the circuit, the measures may track only a subset of the mechanisms driving the observed test-accuracy curve, weakening the claim that grokking is fully explained by gradual circuit amplification.
minor comments (2)
- [Figures 3 and 4] Figure 3 and 4: axis labels and phase boundaries could be annotated more explicitly to make the correspondence between progress-measure curves and the three phases immediately visible.
- [Abstract] The abstract's phrasing 'we fully reverse engineer' is strong given the residual-performance question above; a minor softening would better match the evidence presented.
Simulated Author's Rebuttal
Thank you for the detailed review and positive recommendation for minor revision. We appreciate the focus on strengthening the evidence for complete reverse-engineering and its implications for the progress measures. We address each major comment below.
read point-by-point responses
-
Referee: [§4 (Fourier ablations)] §4 (Fourier ablations): The ablations in Fourier space produce a large drop in test accuracy, but the manuscript does not report the accuracy of the residual non-Fourier components or test whether any parallel non-ablated computations remain active and contribute to generalization. This is load-bearing for the central claim of complete reverse-engineering.
Authors: We agree that reporting the test accuracy of the residual non-Fourier components would provide stronger validation for the completeness of our reverse-engineering. In the revised manuscript, we will add this analysis, showing that the non-Fourier residual achieves high training accuracy (consistent with memorization) but near-chance test accuracy on unseen inputs. This demonstrates that no significant parallel non-Fourier computations contribute to generalization, supporting that the Fourier circuit accounts for the learned algorithm. revision: yes
-
Referee: [§5 (Progress measures)] §5 (Progress measures): The three-phase decomposition (memorization, circuit formation, cleanup) is defined from the identified Fourier components. If the ablation does not fully isolate the circuit, the measures may track only a subset of the mechanisms driving the observed test-accuracy curve, weakening the claim that grokking is fully explained by gradual circuit amplification.
Authors: We maintain that the ablation results (near-total drop in test accuracy upon Fourier component removal) provide strong evidence that the identified circuit is the primary driver, allowing the progress measures to track the full generalization dynamics. However, to further address potential concerns about isolation, we will partially revise §5 by adding explicit correlations between the progress measures and test accuracy, along with controls showing that the phase transitions align specifically with changes in the Fourier components rather than other weight statistics. revision: partial
Circularity Check
No significant circularity; circuit identified via independent analysis before defining progress measures
full rationale
The paper first reverse-engineers the network's algorithm through direct inspection of weights, activations, and Fourier-space ablations, grounding the DFT-plus-trigonometric circuit in external mathematical identities rather than in the grokking dynamics or any fitted progress measures. Only after this identification do the authors define the three phases (memorization, circuit formation, cleanup) as derived quantities to track training. No step reduces by construction to its own inputs, no parameters are fitted to the target curve and relabeled as predictions, and no load-bearing claim rests on self-citation chains. The derivation remains self-contained and externally falsifiable via ablation experiments.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The discrete Fourier transform converts modular addition into component-wise multiplication (rotation) in frequency space.
- domain assumption Ablations performed by zeroing specific frequency components isolate the causal mechanism without introducing new artifacts.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat embedding and orbit periodicity echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We fully reverse engineer the algorithm learned by these networks, which uses discrete Fourier transforms and trigonometric identities to convert addition to rotation about a circle.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
ablating key frequencies used by the model reduces performance to chance, while ablating the other 95% of frequencies slightly improves performance
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
training can be split into three continuous phases: memorization, circuit formation, and cleanup
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models
In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to gener...
-
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
-
Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
The Divergent Remote Association Test (DRAT) is the first creativity test that significantly predicts LLMs' scientific ideation ability, unlike prior tests such as DAT or RAT.
-
Interpreting Reinforcement Learning Agents with Susceptibilities
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
-
ILDR: Geometric Early Detection of Grokking
ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.
-
Grokking of Diffusion Models: Case Study on Modular Addition
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
-
Dimensional Criticality at Grokking Across MLPs and Transformers
Effective cascade dimension D(t) crosses D=1 at the grokking transition in MLPs and Transformers, with opposite directions for modular addition versus XOR, consistent with attraction to a shared critical manifold.
-
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
The grokking delay in encoder-decoder models on one-step Collatz prediction stems from decoder inability to use early-learned encoder representations of parity and residue structure, with numeral base acting as a stro...
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.
-
Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory
A Random Matrix Theory method identifies growing Correlation Traps in neural network weight spectra during an 'anti-grokking' overfitting phase, and applies the same diagnostic to some foundation LLMs.
-
Not How Many, But Which: Parameter Placement in Low-Rank Adaptation
Gradient-informed placement of LoRA parameters recovers full performance under GRPO while random placement does not, due to differences in gradient rank and stability across training regimes.
-
Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization
Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
-
LAG-XAI: A Lie-Inspired Affine Geometric Framework for Interpretable Paraphrasing in Transformer Latent Spaces
LAG-XAI treats paraphrasing as affine flows in semantic manifolds using Lie-inspired approximations, achieving AUC 0.7713 on paraphrase detection and 95.3% hallucination detection on HaluEval.
-
Grokking as Dimensional Phase Transition in Neural Networks
Grokking occurs as the effective dimensionality of the gradient field transitions from sub-diffusive to super-diffusive at the onset of generalization, exhibiting self-organized criticality.
-
PhiNet: Speaker Verification with Phonetic Interpretability
PhiNet adds phonetic interpretability to speaker verification while matching the accuracy of standard black-box models on VoxCeleb, SITW, and LibriSpeech.
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
-
Emergent Semantic Role Understanding in Language Models
Semantic role understanding partially emerges during language model pre-training, with linear probes on frozen representations achieving substantial performance that improves with scale but does not match fine-tuned m...
-
Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance
AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.
-
Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking
Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...
Reference graph
Works this paper leans on
-
[1]
More is different for AI , url=
Steinhardt, Jacob , year=. More is different for AI , url=. Bounded Regret , publisher=
-
[3]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[4]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[9]
2022 ACM Conference on Fairness, Accountability, and Transparency , pages=
Predictability and surprise in large generative models , author=. 2022 ACM Conference on Fairness, Accountability, and Transparency , pages=
work page 2022
-
[10]
Beren's Blog - Thoughts on AI, Neuroscience, and other things that interest me
Grokking 'grokking' , url=. Beren's Blog - Thoughts on AI, Neuroscience, and other things that interest me. , author=
-
[13]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , author=. arXiv preprint arXiv:2206.04615 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Cammarata, Nick and Carter, Shan and Goh, Gabriel and Olah, Chris and Petrov, Michael and Schubert, Ludwig and Voss, Chelsea and Egan, Ben and Lim, Swee Kiat , title =. Distill , year =
-
[16]
A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=
work page 2021
- [17]
-
[18]
arXiv preprint arXiv:2110.07483 , year=
On the pitfalls of analyzing individual neurons in language models , author=. arXiv preprint arXiv:2110.07483 , year=
-
[20]
Advances in neural information processing systems , volume=
Comparing biases for minimal network construction with back-propagation , author=. Advances in neural information processing systems , volume=
-
[22]
Locating and Editing Factual Associations in GPT, January 2023
Locating and Editing Factual Associations in GPT , author=. arXiv preprint arXiv:2202.05262 , year=
-
[23]
IEEE Transactions on Information Theory , volume=
Comparing measures of sparsity , author=. IEEE Transactions on Information Theory , volume=. 2009 , publisher=
work page 2009
-
[25]
The journal of machine learning research , volume=
Dropout: a simple way to prevent neural networks from overfitting , author=. The journal of machine learning research , volume=. 2014 , publisher=
work page 2014
-
[26]
Advances in neural information processing systems , volume=
Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=
- [27]
-
[29]
International Conference on Learning Representations , year=
Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation , author=. International Conference on Learning Representations , year=
-
[30]
Hidden progress in deep learning: Sgd learns parities near the computational limit
Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. arXiv preprint arXiv:2207.08799, 2022
-
[31]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[32]
Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits. Distill, 2020. doi:10.23915/distill.00024. https://distill.pub/2020/circuits
-
[33]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
work page 2021
-
[34]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[35]
Predictability and surprise in large generative models
Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.\ 1747--1764, 2022
work page 2022
-
[36]
Charles R. Harris, K. Jarrod Millman, St \' e fan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fern \' a ndez del R \' i o, Mark Wiebe, Pearu Peterson, Pierre G \' e rard-M...
-
[37]
Comparing measures of sparsity
Niall Hurley and Scott Rickard. Comparing measures of sparsity. IEEE Transactions on Information Theory, 55 0 (10): 0 4723--4741, 2009
work page 2009
-
[38]
Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning. arXiv preprint arXiv:2205.10343, 2022
-
[39]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Acquisition of chess knowledge in alphazero
Thomas McGrath, Andrei Kapishnikov, Nenad Toma s ev, Adam Pearce, Demis Hassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero. arXiv preprint arXiv:2111.09259, 2021
-
[41]
Beren Millidge. Grokking 'grokking', 2022. URL https://www.beren.io/2022-01-11-Grokking-Grokking/
work page 2022
-
[42]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...
work page 2022
-
[43]
The effects of reward misspecification: Mapping and mitigating misaligned models
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022
-
[44]
Pytorch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019
work page 2019
-
[45]
Collaborative data science, 2015
Plotly Technologies Inc. Collaborative data science, 2015. URL https://plot.ly
work page 2015
-
[46]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019
work page 2019
-
[48]
Einops: Clear and reliable tensor manipulations with einstein-like notation
Alex Rogozhnikov. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oapKSVM2bcj
work page 2022
-
[49]
Dropout: a simple way to prevent neural networks from overfitting
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15 0 (1): 0 1929--1958, 2014
work page 1929
-
[50]
More is different for ai, Feb 2022
Jacob Steinhardt. More is different for ai, Feb 2022. URL https://bounded-regret.ghost.io/more-is-different-for-ai/
work page 2022
-
[51]
The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon
Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817, 2022
-
[52]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022 a
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022 b
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
W es M c K inney. D ata S tructures for S tatistical C omputing in P ython. In S t\'efan van der W alt and J arrod M illman (eds.), P roceedings of the 9th P ython in S cience C onference , pp.\ 56 -- 61, 2010. doi:10.25080/Majora-92bf1922-00a
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.