hub

arXiv preprint arXiv:2210.10749 , year=

Transformers Learn Shortcuts to Automata , author= · 2022 · arXiv 2210.10749

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

cs.LG · 2026-05-21 · accept · novelty 8.0

Graph tokenizations for Transformers induce distinct depth regimes with proven separations and impossibility results for converting between them at limited depth.

Learning State-Tracking from Code Using Linear RNNs

cs.LG · 2026-02-16 · unverdicted · novelty 7.0

Linear RNNs track states from REPL code traces of permutations better than Transformers, but non-linear RNNs outperform them in partially observable probabilistic automata.

Scaling Latent Reasoning via Looped Language Models

cs.CL · 2025-10-29 · unverdicted · novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

The two clocks and the innovation window: When and how generative models learn rules

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Neural operators progressively forget domain geometry with depth due to Markovian layers and global mixing; a geometry memory injection mechanism mitigates this forgetting.

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.

The Serial Scaling Hypothesis

cs.LG · 2025-07-16 · unverdicted · novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

A Sharper Picture of Generalization in Transformers

cs.LG · 2026-05-20

citing papers explorer

Showing 1 of 1 citing paper after filters.

A Sharper Picture of Generalization in Transformers cs.LG · 2026-05-20 · unreviewed · ref 19

arXiv preprint arXiv:2210.10749 , year=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer