{"total":80,"items":[{"citing_arxiv_id":"2606.02211","ref_index":137,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Consistency Training while Mitigating Obfuscation via Rate Matching","primary_cat":"cs.CL","submitted_at":"2026-06-01T13:10:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01269","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Emergent Ordinal Geometry in Transformers Trained on Local Comparisons","primary_cat":"cs.AI","submitted_at":"2026-05-31T14:44:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformers trained on local comparisons spontaneously form a rank-aligned one-dimensional embedding manifold that reproduces the symbolic distance effect.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00243","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dynamics and Representation Structure of Local Approximations to Gradient-Based Learning in Linear Recurrent Neural Networks","primary_cat":"cs.NE","submitted_at":"2026-05-29T18:19:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RFLO learning restricts solutions to low-rank perturbations of initial parameters in linear RNNs and produces qualitatively different stability and convergence behavior than BPTT.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00230","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization","primary_cat":"cs.LG","submitted_at":"2026-05-29T18:04:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An exposure-based split on BLiMP data reveals delayed generalization in five grammatical phenomena during LLM pre-training, with post-generalization shifts in concept vector predictiveness and attention patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29823","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Quantifying and Optimizing Simplicity via Polynomial Representations","primary_cat":"cs.AI","submitted_at":"2026-05-28T12:05:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Polynomial representations yield an effective-degree simplicity metric that predicts generalization across tasks and serves as a differentiable regularizer improving performance in classification and RL.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29548","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention","primary_cat":"cs.LG","submitted_at":"2026-05-28T08:02:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28986","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Comparing Classical Simulation and Sample-Based Learning of Quantum Systems: Learning the Hardness of Quantum Systems from Samples","primary_cat":"quant-ph","submitted_at":"2026-05-27T18:44:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical study finds neural-network learning difficulty (via Hessian eigenvalue and random subspace optimization) correlates with classical simulation hardness parameterized by MPS bond dimension and T-gate count.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23565","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22579","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion","primary_cat":"cs.CL","submitted_at":"2026-05-21T14:52:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Hyperfitting improves LLM generation via context-dependent rank reordering from geometric expansion in the terminal transformer block, distinct from temperature scaling, and enables efficient Late-Stage LoRA fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20534","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Axiomatizing Neural Networks via Pursuit of Subspaces","primary_cat":"cs.LG","submitted_at":"2026-05-19T22:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Authors introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic geometric framework that unifies explanations for representation, computation, and generalization in shallow and deep neural networks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20441","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics","primary_cat":"cs.LG","submitted_at":"2026-05-19T19:48:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18180","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Canonical Regularisation of Wide Feature-Learning Neural Networks","primary_cat":"stat.ML","submitted_at":"2026-05-18T10:23:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18022","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise","primary_cat":"cs.LG","submitted_at":"2026-05-18T08:12:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Experiments on modular arithmetic with heavy label noise show that over-parameterized networks form a distributed internal generalization structure that can be extracted via frequency methods to achieve high accuracy despite 80% noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17767","ref_index":198,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent","primary_cat":"stat.ML","submitted_at":"2026-05-18T02:37:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15340","ref_index":64,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bounded-Rationality, Hedging, and Generalization","primary_cat":"cs.LG","submitted_at":"2026-05-14T19:07:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Generalization is a testable hedging property of the learner's response law, recovered via f-divergence regularizers that induce information-geometric curves between training loss and sample dependence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18847","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Transformers Linearly Represent Highly Structured World Models","primary_cat":"cs.LG","submitted_at":"2026-05-13T07:59:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformers trained on Sudoku traces develop constraint-structured internal world models and a monosemantic naked-single circuit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12394","ref_index":22,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Detecting overfitting in Neural Networks during long-horizon grokking using Random Matrix Theory","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:57:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Random Matrix Theory detects overfitting via growing Correlation Traps in weight spectra during the anti-grokking phase of neural network training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Because such histories are usually unavailable for open-weight checkpoints, we first study a setting where the relevant dynamics are visible and overfitting can be readily induced with long-horizon training:grokking. In grokking, training accuracy reaches near perfection while test accuracy stays near chance for many optimization steps, before abruptly improving [22]. Grokking has been studied across several architectures and tasks, including algorithmic tasks such as modular addition, computer vision models, and GPT-style transformers [17, 24]. We extend this view to the long-horizon after grokking, where long-horizon training can drive the model into a classical overfitting phase. We call this post-generalization regimeanti-grokking."},{"citing_arxiv_id":"2605.12199","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Overtrained, Not Misaligned","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11850","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Constrained Stochastic Spectral Preconditioning Converges for Nonconvex Objectives","primary_cat":"math.OC","submitted_at":"2026-05-12T09:36:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proximal stochastic spectral preconditioning converges for nonconvex constrained objectives under heavy-tailed noise, with a variance-reduced version achieving faster rates and a refined analysis of Muon iterations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Xie, M. Erdogan, K. Antonakopoulos, A. Silveti-Falls, and V. Cevher. \"Generalized Gradient Norm Clipping & Non-Euclidean ( L0, L1)-Smoothness\". In:Ad- vances in Neural Information Processing Systems. 2025, pp. 21170-21208. [53] I. Pinelis. \"Multidimensional probability inequalities via spherical symmetry\". In:arXiv preprint arXiv:2210.04391(2022). [54] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. \"Grokking: Generalization beyond overfitting on small algorithmic datasets\". In:arXiv preprint arXiv:2201.02177 (2022). [55] X. Qian, H. Rammal, D. Kovalev, and P. Richtarik. \"Muon is Provably Faster with Momentum Variance Reduction\". In:arXiv preprint arXiv:2512.16598(2025). [56] A. Riabinin, E."},{"citing_arxiv_id":"2605.10237","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently","primary_cat":"cs.LG","submitted_at":"2026-05-11T09:11:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10019","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The two clocks and the innovation window: When and how generative models learn rules","primary_cat":"cs.LG","submitted_at":"2026-05-11T05:44:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"An analytic theory of creativity in convolutional diffusion models, 2024. URLhttps://arxiv.org/abs/2412.20292. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion- based generative models.arXiv preprint arXiv:2206.00364, 2022. Michael Kearns. Efficient noise-tolerant learning from statistical queries.J. ACM, 45(6):983-1006, November 1998. ISSN 0004-5411. doi: 10.1145/293347.293351. URL https://doi.org/10. 1145/293347.293351. Juno Kim and Taiji Suzuki. Transformers provably solve parity efficiently with chain of thought. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singa- pore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview."},{"citing_arxiv_id":"2605.09724","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds","primary_cat":"cs.LG","submitted_at":"2026-05-10T19:47:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"We then compute the total memorisation MT (θ∗(Drand);D rand) via Eq. (2). As we increase n, we obtain acapacity curve MT (n) that initially grows roughly linearly with n (each new datapoint can be memorised) and eventually saturates when the parameters are 'full'. Accordingly , for a given architectureΘwithPparameters we define its capacity as dCap(Θ) := lim n→∞ MT (n),(3) approximated in practice by the plateau of the empirical capacity curve. Repeating across a range of model sizes with parameter countsP 1, . . . , PK yields pairs(P k,dCapk). We then fit a linear model dCapk ≈C modelPk +b,(4) whereC model (bits per parameter) is our empirical capacity constant andbis close to zero. In Section 5.1 we show that for our Transformer family the fit is strongly linear with a small intercept,"},{"citing_arxiv_id":"2605.09345","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features","primary_cat":"cs.LG","submitted_at":"2026-05-10T05:45:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"All rank-monotone pruning scorers converge to identical accuracy at fixed sparsity, but non-monotone features with sparsity-dependent complexity can escape this plateau, as shown by the SICS hypothesis on ViT-Small/CIFAR-10.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"deliberately constructed so that complexity is the controlled variable; our claims are about theSICS structure of the resulting accuracy landscape, not about any specific dynamical system as a method per se. Phase transitions and information-theoretic perspectives in deep learning.Phase tran- sitions appear in many deep learning phenomena: double descent [1, 16], grokking [19], scaling laws [9, 11]. The information-bottleneck literature [22, 23] similarly identifies regime transitions in representation learning.SICScontributes a discrete three-regime transition for one-shot pruning, with the regime determined by sparsity rather than training dynamics. Chaos and dynamical systems in machine learning.Connections between chaos and neural"},{"citing_arxiv_id":"2605.09031","ref_index":79,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models","primary_cat":"cs.LG","submitted_at":"2026-05-09T16:15:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"In the high-dimensional limit the spherical Boltzmann machine admits exact equations for training dynamics, Bayesian evidence, and cascades of phase transitions tied to mode alignment with data, which connect to generative phenomena including double descent and out-of-equilibrium biases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ˇMP (z)≡ R ∞ 0 e−zτ MP (τ) dτ, so the signal equation reduces to z+κ P − ˇMP (z) = νca γ .(77) The Laplace transform of the uncondensed-bath response equation ∂τ RP (τ) =−κ P RP (τ) +R τ 0 MP (τ−σ)R P (σ) dσwithR P (0) = 1gives ˇRP (z) = 1 z+κ P − ˇMP (z) .(78) Substituting (78) into (77) yields the spectral equation 1 = νca γ Z ∞ 0 e−zτ RP (τ) dτ,(79) whose marginal solution z= 0 isolates the static condensation threshold of modea, νc,a =γ/(c aχP ), with χP = R ∞ 0 RP (τ) dτ. Since c1 >· · ·> c K, mode 1 destabilizes first, fixing the dynamical phase boundary atν c =γ/(c 1χP ). Numerically, the two-time system (11)-(13) is solved by causal row-by-row time marching with an implicit predictor-corrector at each new time, enforcingQ(t n, tn) = 1throughκ(t n)."},{"citing_arxiv_id":"2605.08464","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Geometric Structure of Models Learning Sparse Data","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:30:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Normal alignment is the rank-one Jacobian structure that lets classifiers minimize loss and maximize local robustness in sparse regimes; the paper proves its optimality and uses it to create GrokAlign and RFAMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"concept of a continuous underlying data manifold remains entirely inapplicable, even under infinite data assumptions. In this paper, we show that the success of machine learning models in thesesparsesettings can be similarly attributed to the exploitation of low-dimensional structures. Since the phenomenon of grokking - performance on the train set saturating well before performance on a test set saturates [ 11] - is a canonical example of a sparse setting, we use these insights to introduce theGrokAlignstrategy for accelerating grokking training dynamics. Similarly, as tabular data can be described as sparse, we introduceRecursive Feature Alignment Machines (RFAMs)to improve the robustness of Recursive Feature Machines (RFMs) [12] when trained on tabular data."},{"citing_arxiv_id":"2605.07648","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning Large-Scale Modular Addition with an Auxiliary Modulus","primary_cat":"cs.LG","submitted_at":"2026-05-08T12:16:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An auxiliary modulus during training reduces wrap-around issues and preserves train-test input distributions, enabling better accuracy and sample efficiency for large N and q in modular addition learning.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW. [22] AletheaPower, YuriBurda, HarriEdwards, IgorBabuschkin, andVedantMisra. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022. URLhttps://arxiv.org/abs/2201.02177. [23] Sebastian Ruder. An overview of multi-task learning in deep neural networks, 2017. URLhttps: //arxiv.org/abs/1706.05098. [24] Eshika Saxena, Alberto Alfarano, Emily Wenger, and Kristin E. Lauter. Making hard problems easier with custom data distributions and loss regularization: A case study in modular arithmetic. InForty- second International Conference on Machine Learning, 2025."},{"citing_arxiv_id":"2605.06352","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Topological Signatures of Grokking","primary_cat":"cs.LG","submitted_at":"2026-05-07T14:33:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Persistent homology detects a sharp increase in maximum and total H1 persistence during grokking on modular arithmetic, offering a topological diagnostic that links representation geometry to generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06258","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Weight Gram Matrix Captures Sequential Feature Linearization in Deep Networks","primary_cat":"cs.LG","submitted_at":"2026-05-07T13:35:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gradient descent in deep networks implicitly drives features toward target-linear structure as captured by the weight Gram matrix and a derived virtual covariance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06152","ref_index":2,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes","primary_cat":"cs.LG","submitted_at":"2026-05-07T12:45:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Loss spikes are a persistent puzzle in neural network training, creating difficulties for both theoretical understanding and practical stability. One representative example is theSlingshot Mechanism, first observed in the study of grokking under no explicit regularization. Grokking refers to the phenomenon where neural networks achieve sudden generalization long after reaching perfect training accuracy [2]. In such settings, training is often accompanied by periodic instabilities: the norm of the last-layer parameters grows rapidly, often close to exponentially, and is followed by an abrupt training loss spike. Existing work has mainly interpreted Slingshot as an intrinsic optimization phenomenon. For example, Thilak et al. [1] related it to the Edge of Stability (EOS) [3], where the optimizer periodically crosses"},{"citing_arxiv_id":"2605.05683","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization","primary_cat":"stat.ML","submitted_at":"2026-05-07T05:19:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"A tale of two circuits: Grokking as competition of sparse and dense subnetworks, 2023. URLhttps://arxiv.org/abs/2303.11873. [42] Pascal Jr. Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas. Predicting grokking long before it happens: A look into the loss landscape of models which grok, 2023. URLhttps://arxiv.org/abs/2306.13253. [43] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners, 2019. URLhttps://cdn.openai.com/b etter-language-models/language_models_are_unsupervised_multitask_learners.pdf. SPECTRAL LENS: ACTIVATION AND GRADIENT SPECTRA AS DIAGNOSTICS OF LLM OPTIMIZATION 15 [44] KellerJordan/modded-nanogpt contributors."},{"citing_arxiv_id":"2605.05436","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Estimating Implicit Regularization in Deep Learning","primary_cat":"stat.ML","submitted_at":"2026-05-06T20:52:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bias: On the Role of Implicit Regularization in Deep Learning, April 2015. URLhttp: //arxiv.org/abs/1412.6614. arXiv:1412.6614 [cs]. [31] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, January 2022. URLhttp: //arxiv.org/abs/2201.02177. arXiv:2201.02177 [cs]. [32] Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the Spectral Bias of Neural Networks. InProceed- ings of the 36th International Conference on Machine Learning, pages 5301-5310. PMLR, May 2019. URLhttps://proceedings.mlr.press/v97/rahaman19a.html. [33] Reginaldo J. Santos."},{"citing_arxiv_id":"2605.04396","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize","primary_cat":"cs.LG","submitted_at":"2026-05-06T01:39:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformers show a sharp, task-specific critical window for weight decay application that determines reasoning versus memorization, with middle placement optimal and boundaries as narrow as 100 steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04230","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Layerwise LQR for Geometry-Aware Optimization of Deep Networks","primary_cat":"cs.LG","submitted_at":"2026-05-05T19:16:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Steepest descent under divergence-induced quadratic models equals an LQR problem, enabling learning of diagonal or Kronecker-factored inverse preconditioners via a global layerwise objective for scalable geometry-aware training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04024","ref_index":77,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can Transformers predict system collapse in dynamical systems?","primary_cat":"nlin.CD","submitted_at":"2026-05-05T17:48:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformers fail to predict catastrophic collapse in unseen parameter regimes of nonlinear dynamical systems, while reservoir computing reliably succeeds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16325","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry","primary_cat":"cs.LG","submitted_at":"2026-05-05T05:33:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes a two-gradient-field model with candidate order parameters alpha_dagger and kappa_c to unify phase transitions across learning theory and non-equilibrium chemistry.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02968","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency","primary_cat":"cs.LG","submitted_at":"2026-05-03T12:21:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01420","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance","primary_cat":"cs.AI","submitted_at":"2026-05-02T12:37:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01172","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Theory of Generalization in Deep Learning","primary_cat":"cs.LG","submitted_at":"2026-05-02T00:21:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A theory shows SGD accumulates coherent signal via linear drift in NTK signal directions while trapping noise in orthogonal low-eigenvalue dimensions, enabling generalization even under O(1) kernel evolution and yielding an exact population-risk objective from one run that acts as an Adam SNR boost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08119","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Feature Repulsion and Spectral Lock-in: An Empirical Study of Two-Layer Network Grokking","primary_cat":"cs.LG","submitted_at":"2026-04-28T03:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Empirical tests confirm robust feature repulsion signs but reveal activation-dependent spectral lock-in in grokking, with x^2 yielding rank-2 updates at epoch ~174 and ReLU remaining rank-1.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25143","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories","primary_cat":"cs.LG","submitted_at":"2026-04-28T02:44:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20817","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Convergent Evolution: How Different Language Models Learn Similar Number Representations","primary_cat":"cs.CL","submitted_at":"2026-04-22T17:45:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diverse language models converge on similar periodic number features with a two-tier hierarchy of Fourier sparsity and geometric separability, acquired via language co-occurrences or multi-token arithmetic.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20923","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ILDR: Geometric Early Detection of Grokking","primary_cat":"cs.LG","submitted_at":"2026-04-22T06:14:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ILDR detects the geometric reorganization preceding grokking by measuring when inter-class centroid separation exceeds intra-class scatter by 2.5 times its baseline in penultimate-layer representations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19740","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Generalization at the Edge of Stability","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17673","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Grokking of Diffusion Models: Case Study on Modular Addition","primary_cat":"cs.LG","submitted_at":"2026-04-20T00:02:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13123","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spectral Entropy Collapse as a Phase Transition in Delayed Generalisation: An Interventional and Predictive Framework for Grokkin","primary_cat":"cs.LG","submitted_at":"2026-04-13T18:23:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral entropy collapse in learned representations precedes and predicts grokking, with interventions showing it is not explained by parameter norm alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09258","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima","primary_cat":"cs.LG","submitted_at":"2026-04-10T12:17:18+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, and Hannaneh Hajishirzi. 2 olmo 2 furious, 2025. [31] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017. [32] Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. [33] Mihir Prabhudesai, Mengning Wu, Amir Zadeh, Katerina Fragkiadaki, and Deepak Pathak. Diffusion beats autoregressive in data-constrained settings.arXiv preprint arXiv:2507."},{"citing_arxiv_id":"2604.08358","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation","primary_cat":"quant-ph","submitted_at":"2026-04-09T15:21:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Neural decoder for quantum LDPC codes achieves ~10^{-10} logical error at 0.1% physical error with 17x improvement and high throughput, enabling practical fault tolerance at modest code sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07962","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Is your algorithm unlearning or untraining?","primary_cat":"cs.LG","submitted_at":"2026-04-09T08:24:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Machine unlearning conflates reversing the influence of specific training examples (untraining) with removing the full underlying distribution or behavior (unlearning).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09716","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach","primary_cat":"cs.CV","submitted_at":"2026-04-08T12:41:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces integration, metastability, and dynamical stability index measures from layer activations and reports patterns distinguishing CIFAR-10 from CIFAR-100 difficulty plus early convergence signals across ResNet variants, DenseNet, MobileNetV2, VGG-16, and a Vision Transformer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06256","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Spectral Edge Dynamics Reveal Functional Modes of Learning","primary_cat":"cs.LG","submitted_at":"2026-04-06T22:29:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Spectral edge dynamics during grokking reveal task-dependent low-dimensional functional modes over inputs, such as Fourier modes for modular addition and cross-term decompositions for x squared plus y squared.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}