hub

arXiv preprint arXiv:2210.10749 , year=

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang · 2022 · arXiv 2210.10749

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Transformers Provably Learn to Internalize Chain-of-Thought

cs.LG · 2026-05-27 · unverdicted · novelty 8.0

L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.

Lost in Tokenization: Fundamental Trade-offs in Graph Tokenization for Transformers

cs.LG · 2026-05-21 · accept · novelty 8.0

Graph tokenizations for Transformers induce distinct depth regimes with proven separations and impossibility results for converting between them at limited depth.

Agentic Transformers Provably Learn to Search via Reinforcement Learning

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

In a stochastic k-ary tree, a two-head transformer learns randomized DFS via policy gradient under depth-wise curriculum, generalizes to deeper trees, and adapts to imbalanced goals via discounting.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.

Pretraining Recurrent Networks without Recurrence

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.

A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

cs.HC · 2026-05-26 · unverdicted · novelty 6.0

Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.

A Sharper Picture of Generalization in Transformers

cs.LG · 2026-05-20 · unverdicted · novelty 6.0 · 2 refs

PAC-Bayes applied to low-sharpness flat minima yields non-vacuous generalization bounds for boolean functions whose Fourier spectra are sparse and low-degree, with parameters estimable by property testing.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

The two clocks and the innovation window: When and how generative models learn rules

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Neural operators progressively forget domain geometry with depth due to Markovian layers and global mixing; a geometry memory injection mechanism mitigates this forgetting.

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.

The Serial Scaling Hypothesis

cs.LG · 2025-07-16 · unverdicted · novelty 5.0

The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.

There Will Be a Scientific Theory of Deep Learning

stat.ML · 2026-04-23 · unverdicted · novelty 2.0

A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

Learning State-Tracking from Code Using Linear RNNs

cs.LG · 2026-02-16

Scaling Latent Reasoning via Looped Language Models

cs.CL · 2025-10-29

citing papers explorer

Showing 13 of 13 citing papers after filters.

Transformers Provably Learn to Internalize Chain-of-Thought cs.LG · 2026-05-27 · unverdicted · none · ref 28
L-layer transformers under Log-ICoT curriculum provably learn k-parity with poly(n) samples and log k stages, matching explicit CoT efficiency without inference overhead.
Agentic Transformers Provably Learn to Search via Reinforcement Learning cs.LG · 2026-05-29 · unverdicted · none · ref 12
In a stochastic k-ary tree, a two-head transformer learns randomized DFS via policy gradient under depth-wise curriculum, generalizes to deeper trees, and adapts to imbalanced goals via discounting.
Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference cs.CL · 2026-05-25 · unverdicted · none · ref 36
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
Pretraining Recurrent Networks without Recurrence cs.LG · 2026-06-04 · unverdicted · none · ref 77
SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.
A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners cs.LG · 2026-06-02 · unverdicted · none · ref 15
Supervised fine-tuning lets LLMs linearly encode action validity and state predicates, with broader state-space coverage during training improving world-model recovery.
A Systematic Study of Behavioral Cloning for Scientific Data Annotation cs.HC · 2026-05-26 · unverdicted · none · ref 71
Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.
A Sharper Picture of Generalization in Transformers cs.LG · 2026-05-20 · unverdicted · none · ref 19 · 2 links
PAC-Bayes applied to low-sharpness flat minima yields non-vacuous generalization bounds for boolean functions whose Fourier spectra are sparse and low-degree, with parameters estimable by property testing.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 89
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
The two clocks and the innovation window: When and how generative models learn rules cs.LG · 2026-05-11 · unverdicted · none · ref 14
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Do Neural Operators Forget Geometry? The Forgetting Hypothesis in Deep Operator Learning cs.LG · 2026-05-07 · unverdicted · none · ref 9
Neural operators progressively forget domain geometry with depth due to Markovian layers and global mixing; a geometry memory injection mechanism mitigates this forgetting.
The Recurrent Transformer: Greater Effective Depth and Efficient Decoding cs.LG · 2026-04-23 · unverdicted · none · ref 65
Recurrent Transformers add per-layer recurrent memory via self-attention on own activations plus a tiling algorithm that reduces training memory traffic, yielding better C4 pretraining cross-entropy than parameter-matched standard transformers with fewer layers.
The Serial Scaling Hypothesis cs.LG · 2025-07-16 · unverdicted · none · ref 63
The serial scaling hypothesis formalizes inherently serial problems in complexity theory and demonstrates that diffusion models cannot solve them.
There Will Be a Scientific Theory of Deep Learning stat.ML · 2026-04-23 · unverdicted · none · ref 243
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.

arXiv preprint arXiv:2210.10749 , year=

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer