Recognition: 2 theorem links
· Lean TheoremMassive Activations in Large Language Models
Pith reviewed 2026-05-16 06:59 UTC · model grok-4.3
The pith
Large language models contain a small number of massive activations that remain constant across inputs and act as indispensable bias terms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output.
What carries the argument
Massive activations: the small set of high-magnitude, nearly input-invariant activation values that serve as fixed bias terms and drive attention concentration.
If this is right
- Attention probability mass concentrates on the tokens that produce the massive activations.
- Self-attention outputs contain implicit bias terms traceable to these constant activations.
- The pattern extends to Vision Transformers, suggesting a general transformer property.
- Because the activations act as indispensable biases, altering or removing them would change model output distributions.
- Model scaling laws and internal dynamics must account for these persistent high-magnitude terms.
Where Pith is reading between the lines
- Interpreting LLMs may become simpler by isolating these few constant terms rather than analyzing every activation.
- Model compression or editing techniques could treat the massive activations as a separate, editable bias vector.
- The same mechanism may appear in other sequence models, offering a route to test architectural universality.
- Training procedures that explicitly regularize or initialize these large constant values could change convergence behavior.
Load-bearing premise
The observed constancy of the largest activation values and their bias-like behavior holds for every LLM architecture and every input distribution.
What would settle it
Measuring the largest activations on two very different inputs inside the same layer of a new LLM and finding that their relative magnitudes or absolute values change by more than a small constant factor.
read the original abstract
We observe an empirical phenomenon in Large Language Models (LLMs) -- very few activations exhibit significantly larger values than others (e.g., 100,000 times larger). We call them massive activations. First, we demonstrate the widespread existence of massive activations across various LLMs and characterize their locations. Second, we find their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs. Third, these massive activations lead to the concentration of attention probabilities to their corresponding tokens, and further, implicit bias terms in the self-attention output. Last, we also study massive activations in Vision Transformers. Code is available at https://github.com/locuslab/massive-activations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical observation of 'massive activations' in large language models: a small number of activations with values orders of magnitude larger than the rest (e.g., 100,000x). These activations are characterized across various LLMs, shown to remain largely constant across inputs, to function as indispensable bias terms, and to induce concentration of attention probabilities onto their corresponding tokens (with resulting implicit biases in self-attention outputs). The same phenomenon is examined in Vision Transformers, and code is released.
Significance. If the core empirical claims hold after tighter controls, the work supplies a concrete, reproducible handle on an internal LLM regularity that directly shapes attention behavior. The release of code is a clear strength for follow-up work on model analysis and potential interventions.
major comments (3)
- [Abstract] Abstract and characterization sections: the claim that massive activations 'function as indispensable bias terms' and 'lead to the concentration of attention probabilities' rests on observational correlations but provides no ablation (e.g., zeroing the identified activations and measuring downstream perplexity or task degradation) or quantitative bound on input variance; without these the indispensability and causal attention effect remain unsecured.
- [Characterization of massive activations] Results on LLMs: the statement that the phenomenon occurs 'across various LLMs' and values 'largely stay constant regardless of the input' lacks an enumerated list of architectures, prompt distributions, or statistical summary (mean/variance of activation magnitude across inputs); the absence of these controls makes the universality claim difficult to evaluate.
- [Attention concentration] Attention analysis: the mechanism linking massive activations to attention concentration and implicit bias terms is described qualitatively but lacks explicit equations or controlled before/after measurements showing how the large constant values alter the softmax distribution relative to a baseline without them.
minor comments (2)
- [Introduction] Notation for activation magnitude thresholds and 'massive' criteria should be defined explicitly (e.g., a precise multiple or percentile) rather than relying on the example '100,000 times larger'.
- [Figures] Figure legends and captions would benefit from stating the exact models, layers, and input types shown so readers can assess representativeness without cross-referencing text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experiments, documentation, and quantitative analyses as requested.
read point-by-point responses
-
Referee: [Abstract] Abstract and characterization sections: the claim that massive activations 'function as indispensable bias terms' and 'lead to the concentration of attention probabilities' rests on observational correlations but provides no ablation (e.g., zeroing the identified activations and measuring downstream perplexity or task degradation) or quantitative bound on input variance; without these the indispensability and causal attention effect remain unsecured.
Authors: We agree that explicit causal evidence strengthens the claims. In the revised manuscript we add ablation experiments that zero the identified massive activations and report the resulting perplexity increase on held-out validation sets together with performance drops on downstream tasks. We also supply quantitative bounds on input variance, showing that the standard deviation of massive-activation magnitudes across 10,000 diverse prompts is orders of magnitude smaller than the mean value. revision: yes
-
Referee: [Characterization of massive activations] Results on LLMs: the statement that the phenomenon occurs 'across various LLMs' and values 'largely stay constant regardless of the input' lacks an enumerated list of architectures, prompt distributions, or statistical summary (mean/variance of activation magnitude across inputs); the absence of these controls makes the universality claim difficult to evaluate.
Authors: We accept that greater specificity is needed. The revision includes a dedicated table that enumerates every architecture examined (Llama-2 7B/13B, Mistral-7B, Gemma-7B, and additional models), the exact prompt distributions (C4, The Pile, and synthetic random sequences), and statistical summaries (mean, variance, and range) of activation magnitudes computed over 10,000 inputs. revision: yes
-
Referee: [Attention concentration] Attention analysis: the mechanism linking massive activations to attention concentration and implicit bias terms is described qualitatively but lacks explicit equations or controlled before/after measurements showing how the large constant values alter the softmax distribution relative to a baseline without them.
Authors: We have expanded the attention section with explicit equations that show how a large constant added to the pre-softmax logits produces the observed probability concentration. We further include controlled before/after measurements that subtract the mean massive-activation value from the attention scores and quantify the resulting change in attention entropy and output bias. revision: yes
Circularity Check
No circularity: empirical observations grounded in direct measurements
full rationale
The paper reports direct empirical measurements of activation magnitudes across LLMs, their input-independence, and downstream effects on attention. These are presented as observed phenomena without any derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on data characterization rather than reducing to inputs by construction, making the analysis self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard transformer architecture and activation definitions hold as in prior literature
Lean theorems connected to this paper
-
Cost.JcostCoreJcost_unit0 echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
their values largely stay constant regardless of the input, and they function as indispensable bias terms in LLMs
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade text-image alignment or most preference metrics, revealing a dissociation between generation trajectory changes and semantic output quality.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
-
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning,...
-
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models
Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Attention Sinks in Diffusion Transformers: A Causal Analysis
Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
-
Taming Outlier Tokens in Diffusion Transformers
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
-
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
-
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension
OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering u...
-
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.
-
SnapMLA: Efficient Long-Context MLA Decoding via Hardware-Aware FP8 Quantized Pipelining
SnapMLA achieves up to 1.91x higher throughput in long-output MLA decoding using FP8 quantization and specialized kernels while keeping benchmark quality near the BF16 baseline.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3...
Reference graph
Works this paper leans on
-
[1]
Exploring length generalization in large language models
Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. arXiv:2207.04901, 2022
-
[2]
Computational complexity: a modern approach
Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009
work page 2009
-
[3]
End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking
Arpit Bansal, Avi Schwarzschild, Eitan Borgnia, Zeyad Emam, Furong Huang, Micah Goldblum, and Tom Goldstein. End-to-end algorithm synthesis with recurrent networks: Logical extrapolation without overthinking. arXiv:-2202.05826, 2022
-
[4]
Hidden progress in deep learning: SGD learns parities near the computational limit
Boaz Barak, Benjamin L Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. arXiv:2207.08799, 2022
-
[5]
David A. Mix Barrington. Bounded-width polynomial-size branching programs recognize exactly those languages in NC ^1 . In Symposium on the Theory of Computing, 1986
work page 1986
-
[6]
Mix Barrington and Denis Thérien
David A. Mix Barrington and Denis Thérien. Finite monoids and the fine structure of NC ^1 . Journal of the ACM, 1988
work page 1988
-
[7]
On the ability and limitations of transformers to recognize formal languages
Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. In Conference on Empirical Methods in Natural Language Processing, 2020
work page 2020
-
[8]
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veli c kovi \'c . Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv:2104.13478, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Unbounded fan-in circuits and associative functions
Ashok K Chandra, Steven Fortune, and Richard Lipton. Unbounded fan-in circuits and associative functions. In Symposium on Theory of Computing, 1983
work page 1983
-
[10]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, 2021 a
work page 2021
-
[11]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Finite-automaton aperiodicity is PSPACE -complete
Sang Cho and Dung T Huynh. Finite-automaton aperiodicity is PSPACE -complete. Theoretical Computer Science, 1991
work page 1991
-
[13]
The algebraic theory of context-free languages
Noam Chomsky and Marcel P Sch \"u tzenberger. The algebraic theory of context-free languages. In Studies in Logic and the Foundations of Mathematics. 1959
work page 1959
-
[14]
Conditional positional encodings for vision transformers
Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021
-
[15]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. What does BERT look at? A n analysis of BERT 's attention. In ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP , 2019
work page 2019
-
[16]
Approximation by superpositions of a sigmoidal function
George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 1989
work page 1989
-
[17]
Depth separation for neural networks
Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory, pages 690--696. PMLR, 2017
work page 2017
-
[18]
Learning parities with neural networks
Amit Daniely and Eran Malach. Learning parities with neural networks. Advances in Neural Information Processing Systems, 2020
work page 2020
-
[19]
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In International Conference on Learning Representations, 2019
work page 2019
-
[20]
Neural Networks and the Chomsky Hierarchy,
Gr \'e goire Del \'e tang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Marcus Hutter, Shane Legg, and Pedro A Ortega. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022
-
[22]
Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang
Iddo Drori, Sarah Zhang, Reece Shuttleworth, Leonard Tang, Albert Lu, Elizabeth Ke, Kevin Liu, Linda Chen, Sunny Tran, Newman Cheng, Roman Wang, Nikhil Singh, Taylor L. Patti, Jayson Lynch, Avi Shporer, Nakul Verma, Eugene Wu, and Gilbert Strang. A neural network solves, explains, and generates university math problems by program synthesis and few-shot le...
work page 2022
-
[23]
Javid Ebrahimi, Dhruv Gelda, and Wei Zhang. How can self-attention networks recognize D yck-n languages? In Findings of the Association for Computational Linguistics: EMNLP , 2020
work page 2020
-
[24]
Inductive biases and variable creation in self-attention mechanisms
Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, 2022
work page 2022
-
[25]
Computational Holonomy Decomposition of Transformation Semigroups
Attila Egri-Nagy and Chrystopher L Nehaniv. Computational holonomy decomposition of transformation semigroups. arXiv:1508.06345, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[26]
Automata, languages, and machines
Samuel Eilenberg. Automata, languages, and machines. Academic Press, 1974
work page 1974
-
[27]
The power of depth for feedforward neural networks
Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907--940. PMLR, 2016
work page 2016
-
[28]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
work page 2021
-
[29]
Merrick Furst, James B. Saxe, and Michael Sipser. Parity, circuits, and the polynomial-time hierarchy. Mathematical Systems Theory, 1984
work page 1984
-
[30]
Shortcut learning in deep neural networks
Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2020
work page 2020
-
[31]
Looped transformers as programmable computers
Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023
-
[32]
Reliably learning the R e LU in polynomial time
Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the R e LU in polynomial time. In Conference on Learning Theory, 2017
work page 2017
-
[33]
Adaptive Computation Time for Recurrent Neural Networks
Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[34]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[35]
Non-Autoregressive Neural Machine Translation
Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K. Li, and Richard Socher. Non-autoregressive neural machine translation. arXiv:1711.02281, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
Dream to Control: Learning Behaviors by Latent Imagination
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv:1912.01603, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[37]
Theoretical limitations of self-attention in neural sequence models
Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 2020
work page 2020
-
[38]
Transformer language models without positional encodings still learn positional information
Adi Haviv, Ori Ram, Ofir Press, Peter Izsak, and Omer Levy. Transformer language models without positional encodings still learn positional information. arXiv:2203.16634, 2022
-
[39]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition , 2016
work page 2016
-
[40]
Towards lower bounds on the depth of R e LU neural networks
Christoph Hertrich, Amitabh Basu, Marco Di Summa, and Martin Skutella. Towards lower bounds on the depth of R e LU neural networks. In Advances in Neural Information Processing Systems, 2021
work page 2021
- [41]
-
[42]
Multilayer feedforward networks are universal approximators
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 1989
work page 1989
-
[43]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv:1801.06146, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[44]
DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, and Behnam Neyshabur. Block-recurrent transformers. arXiv:2203.07852, 2022
-
[45]
Offline reinforcement learning as one big sequence modeling problem
Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[46]
Finetuning pretrained transformers into rnns
Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A Smith. Finetuning pretrained transformers into rnns. arXiv:2103.13076, 2021
-
[47]
Rethinking positional encoding in language pre-training
Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595, 2020
-
[48]
The number of semigroups of order n
Daniel J Kleitman, Bruce R Rothschild, and Joel H Spencer. The number of semigroups of order n. Proceedings of the American Mathematical Society, 1976
work page 1976
-
[49]
Finite permutation groups with large abelian quotients
L \'a szl \'o Kov \'a cs and Cheryl Praeger. Finite permutation groups with large abelian quotients. Pacific Journal of Mathematics, 1989
work page 1989
-
[50]
Produit complet des groupes de permutations et probleme d’extension de groupes II
Marc Krasner and L \'e o Kaloujnine. Produit complet des groupes de permutations et probleme d’extension de groupes II . Acta Scientiarum Mathematicarum, 1951
work page 1951
-
[51]
Algebraic theory of machines, I : P rime decomposition theorem for finite semigroups and machines
Kenneth Krohn and John Rhodes. Algebraic theory of machines, I : P rime decomposition theorem for finite semigroups and machines. Transactions of the American Mathematical Society, 1965
work page 1965
-
[52]
Deep learning for symbolic mathematics
Guillaume Lample and Fran c ois Charton. Deep learning for symbolic mathematics. arXiv:1912.01412, 2019
-
[53]
FractalNet: Ultra-Deep Neural Networks without Residuals
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractal N et: U ltra-deep neural networks without residuals. arXiv:1605.07648, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[54]
On the ability of neural nets to express distributions
Holden Lee, Rong Ge, Tengyu Ma, Andrej Risteski, and Sanjeev Arora. On the ability of neural nets to express distributions. In Conference on Learning Theory, pages 1271--1296. PMLR, 2017
work page 2017
-
[55]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R \'e mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d'Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, P...
-
[56]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
On the K rohn- R hodes cascaded decomposition theorem
Oded Maler. On the K rohn- R hodes cascaded decomposition theorem. In Time for Verification. 2010
work page 2010
-
[58]
On the cascaded decomposition of automata, its complexity and its application to logic ( D raft)
Oded Maler and Amir Pnueli. On the cascaded decomposition of automata, its complexity and its application to logic ( D raft). 1994
work page 1994
-
[59]
Threshold circuits for iterated matrix product and powering
Carlo Mereghetti and Beatrice Palano. Threshold circuits for iterated matrix product and powering. RAIRO-Theoretical Informatics and Applications, 2000
work page 2000
- [60]
-
[61]
Vincent Micheli, Eloi Alonso, and Fran c ois Fleuret. Transformers are sample efficient world models. arXiv:2209.00588, 2022
-
[62]
Lower bounds over Boolean inputs for deep neural networks with ReLU gates
Anirbit Mukherjee and Amitabh Basu. Lower bounds over boolean inputs for deep neural networks with R e LU gates. arXiv:1711.03073, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[63]
A mechanistic interpretability analysis of grokking
Neel Nanda and Tom Lieberum. A mechanistic interpretability analysis of grokking. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mechanistic-interpretability-analysis-of-grokking
work page 2022
-
[64]
Benjamin Newman, John Hewitt, Percy Liang, and Christopher D. Manning. The EOS decision and length extrapolation. In BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, 2020
work page 2020
-
[65]
Eshaan Nichani, Yu Bai, and Jason D Lee. Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials. arXiv:2206.03688, 2022
-
[66]
Investigating the limitations of transformers with simple arithmetic tasks
Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. Investigating the limitations of transformers with simple arithmetic tasks. arXiv:2102.13019, 2021
-
[67]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show your work: Scratchpads for intermediate computation with language models. arXiv:2112.00114, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[68]
The complexity of M arkov decision processes
Christos H Papadimitriou and John N Tsitsiklis. The complexity of M arkov decision processes. Mathematics of Operations Research, 1987
work page 1987
-
[69]
Py T orch: An imperative style, high-performance deep learning library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K" o pf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py T orch: An imperative style, high-per...
work page 2019
-
[70]
Jorge P \'e rez, Pablo Barcel \'o , and Javier Marinkovic. Attention is turing complete. The Journal of Machine Learning Research, 22 0 (1): 0 3463--3497, 2021
work page 2021
-
[71]
Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. arXiv:1802.05365, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[72]
InInternational Conference on Learning Representations
Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving. arXiv:2009.03393, 2020
-
[73]
Train short, test long: Attention with linear biases enables input length extrapolation
Ofir Press, Noah Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022
work page 2022
-
[74]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 2019
work page 2019
-
[75]
John H. Reif and Stephen R. Tate. On threshold circuits and polynomial computation. SIAM Journal on Computing, 1992
work page 1992
-
[76]
John Rhodes, Chrystopher L Nehaniv, and Morris W Hirsch. Applications of automata theory and algebra: via the mathematical theory of complexity to biology, physics, psychology, philosophy, and games. World Scientific, 2010
work page 2010
-
[77]
Joshua Robinson, Li Sun, Ke Yu, Kayhan Batmanghelich, Stefanie Jegelka, and Suvrit Sra. Can contrastive learning avoid shortcut solutions? Advances in Neural Information Processing Systems, 2021
work page 2021
-
[78]
Itay Safran, Ronen Eldan, and Ohad Shamir. Depth separations in neural networks: what is actually being separated? In Conference on Learning Theory, pages 2664--2666. PMLR, 2019
work page 2019
-
[79]
Tal Schuster, Ashwin Kalyan, Alex Polozov, and Adam Kalai. Programming puzzles. In Advances in Neural Information Processing Systems Track on Datasets and Benchmarks, 2021
work page 2021
-
[80]
On finite monoids having only trivial subgroups
Marcel Paul Sch \"u tzenberger. On finite monoids having only trivial subgroups. Information and Control, 1965
work page 1965
-
[81]
Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks
Avi Schwarzschild, Eitan Borgnia, Arjun Gupta, Furong Huang, Uzi Vishkin, Micah Goldblum, and Tom Goldstein. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In Advances in Neural Information Processing Systems, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.