Graph tokenizations for Transformers induce distinct depth regimes with proven separations and impossibility results for converting between them at limited depth.
hub
arXiv preprint arXiv:2210.10749 , year=
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Linear RNNs track states from REPL code traces of permutations better than Transformers, but non-linear RNNs outperform them in partially observable probabilistic automata.
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Neural operators progressively forget domain geometry with depth due to Markovian layers and global mixing; a geometry memory injection mechanism mitigates this forgetting.
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
citing papers explorer
-
Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers
Graph tokenizations for Transformers induce distinct depth regimes with proven separations and impossibility results for converting between them at limited depth.
-
Learning State-Tracking from Code Using Linear RNNs
Linear RNNs track states from REPL code traces of permutations better than Transformers, but non-linear RNNs outperform them in partially observable probabilistic automata.
-
Scaling Latent Reasoning via Looped Language Models
Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning
Neural operators progressively forget domain geometry with depth due to Markovian layers and global mixing; a geometry memory injection mechanism mitigates this forgetting.
-
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
-
The Serial Scaling Hypothesis
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.
- A Sharper Picture of Generalization in Transformers