{"total":12,"items":[{"citing_arxiv_id":"2607.00913","ref_index":32,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Two AI Metrics Diverged: Will it Make All the Difference?","primary_cat":"cs.AI","submitted_at":"2026-07-01T13:18:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Bounded performance metrics always favor convergence of AI capabilities to meek models while unbounded metrics allow frontier models to maintain leads indefinitely, with policy implications for capability concentration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24998","ref_index":48,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Internal Data Repetition Destroys Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-23T16:02:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repetition of training data produces a systematic eval loss peak at intermediate repeat counts whose location scales with model size, quantifiable as large compute-equivalent loss even at modest repetition fractions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20299","ref_index":200,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Statistical Properties of Training & Generalization","primary_cat":"stat.ML","submitted_at":"2026-06-18T14:35:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19781","ref_index":29,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Engineering Scaling Laws with Pretraining Data Composition","primary_cat":"hep-ex","submitted_at":"2026-06-18T04:32:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pretraining data composition can be used to engineer neural scaling laws in hadronic jet classification toward data-heavy rather than model-size-heavy regimes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08167","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Explaining Data Mixing Scaling Laws","primary_cat":"cs.LG","submitted_at":"2026-06-06T13:31:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09189","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Practical Scaling Laws: Converting Compute into Performance in a Data-Constrained World","primary_cat":"cs.LG","submitted_at":"2026-05-09T22:07:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new scaling law L(N, D, T) = E + (L0 - E) h/(1+h) with h = a/N^α + b/T^β + c N^γ/D^δ that decomposes loss into undercapacity, undertraining, and overfitting terms and saturates between E and L0.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"predict loss at scales practitioners cannot afford to run. The most-cited instance is the Chinchilla form [26]: L(N, D) =E+ A N α + B Dβ .(1) Preprint. arXiv:2605.09189v1 [cs.LG] 9 May 2026 Variants of this power-law-plus-constant template recur across language modeling [27], vision [1], transfer learning [23], downstream tasks [12, 16], and theoretical models [5, 34, 10], and inherit its structural commitments. Three structural failures.Though Equation (1) and its close relatives have been used to fit training runs across many orders of magnitude in the data-rich, single-epoch regime, these forms lack three crucial properties that prevent them from generalizing beyond this regime: • No baseline saturation."},{"citing_arxiv_id":"2605.09154","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Predicting Large Model Test Losses with a Noisy Quadratic System","primary_cat":"cs.LG","submitted_at":"2026-05-09T20:35:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A noisy quadratic system predicts large model test losses from N, B, K and outperforms Chinchilla's model for extrapolation up to 1000x compute.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"nq )2(K−k) ξk is a weighted average of indept.ξ k's. This gives usE h (w(K) n −w ∗ n)(w∗ n −w (0) n ) i =E h (Fbias(w(0) −w ∗) +F var)(w∗ n −w (0) n ) i (35) =E h −Fbias(w(0) −w ∗)2 i +E h Fvar(w(0) −w ∗) i (36) =E h −Fbias(w(0) −w ∗)2 i +E[F var]E h (w(0) −w ∗) i (37) =E h −Fbias(w(0) −w ∗)2 i + 0×E h (w(0) −w ∗) i (38) =−F biasE h (w(0) −w ∗)2 i .(39) 21 Predicting Large Model Test Losses with a Noisy Quadratic System The second last line makes use of the zero expectation and independence ofξ k's. ThereforeE h (w(K) n −w (0) n )2 i =E h (w(K) n −w ∗ n) + (w∗ n −w (0) n ) \u00012i (40) =E h (w(K) n −w ∗ n)2 i +E h 2(w(K) n −w ∗ n)(w∗ n −w (0) n ) i +E h (w∗ n −w (0) n )2 i (41) =E h (w(K) n −w ∗ n)2"},{"citing_arxiv_id":"2605.05683","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Spectral Lens: Activation and Gradient Spectra as Diagnostics of LLM Optimization","primary_cat":"stat.ML","submitted_at":"2026-05-07T05:19:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Spectral analysis of activations and gradients provides new diagnostics that link batch size to representation geometry, early covariance tails to token efficiency, and spectral shifts to learning dynamics in decoder-only LLMs, backed by a mechanistic model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"small-scale models transfer predictably to larger tiers, providing a stable protocol for forecasting dynamics as models grow in depth and parameter count. Our empirical study proceeds in two stages: a systematic scan of decoder-only models to identify transferable spectral signatures, followed by a toy-model analysis for mechanistic grounding. Using the modded-NanoGPT codebase [10], we generate a controlled intervention chain and evaluate diagnostics across layer tiers. Fig. 1 previews these results: we identify internal geometries that diverge despite matched loss (Fig. 1a), demonstrate that early-training signatures identified in 12-layer runs reliably predict the token-efficiency of 36- and 48-layer models up to 3.57B parameters"},{"citing_arxiv_id":"2604.21691","ref_index":107,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"There Will Be a Scientific Theory of Deep Learning","primary_cat":"stat.ML","submitted_at":"2026-04-23T13:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universal behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09220","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TinyNeRV: Compact Neural Video Representations via Capacity Scaling, Distillation, and Low-Precision Inference","primary_cat":"cs.CV","submitted_at":"2026-04-10T11:26:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Tiny NeRV models using capacity scaling, frequency-aware distillation, and low-precision quantization achieve favorable quality-efficiency trade-offs with far fewer parameters and lower computational costs than standard NeRV.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: Proceedings of the International Conference on Machine Learning, 2019. [27] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization and training of neural networks for efficient integer-arithmetic-only inference, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. [28] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, U. Sharma, Explaining neural scaling laws, Proceedings of the National Academy of Sciences 121 (27) (2024) e2311878121.arXiv:https://www.pnas.org/doi/pdf/10.1073/pnas.2311878121,doi:10.1073/pnas.2311878121. URLhttps://www.pnas.org/doi/abs/10.1073/pnas.2311878121 [29] A. Sengupta, Y. Goel, T. Chakraborty, How to upscale neural networks with scaling law?"},{"citing_arxiv_id":"2602.13298","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs","primary_cat":"cs.CV","submitted_at":"2026-02-09T10:14:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Effective depth, an operational count of sequential transformations, predicts CNN trainability better than nominal layer count because shortcuts and branches decouple the two.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.00592","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Scaling and renormalization in high-dimensional regression","primary_cat":"stat.ML","submitted_at":"2024-05-01T15:59:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}