super hub Canonical reference

Scaling Laws for Neural Language Models

Benjamin Chess, Jared Kaplan, Rewon Child, Sam McCandlish, Tom B Brown, Tom Henighan · 2020 · cs.LG · arXiv 2001.08361

Canonical reference. 83% of citing Pith papers cite this work as background.

761 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 761 citing papers more from Benjamin Chess arXiv PDF

abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 119 method 6 dataset 3 baseline 2 other 2

citation-polarity summary

background 110 unclear 8 use method 6 support 3 use dataset 3 baseline 2

claims ledger

abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s

authors

Benjamin Chess Jared Kaplan Rewon Child Sam McCandlish Tom B Brown Tom Henighan

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

econ.GN · 2026-05-19 · unverdicted · novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

cs.LG · 2025-05-30 · unverdicted · novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.

citing papers explorer

Showing 50 of 261 citing papers after filters.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization cs.LG · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry cs.LG · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking cs.LG · 2026-02-18 · unverdicted · none · ref 2 · internal anchor
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods cs.LG · 2025-06-12 · unverdicted · none · ref 7 · internal anchor
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States cs.LG · 2025-05-30 · unverdicted · none · ref 14 · internal anchor
Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States cs.LG · 2024-07-05 · conditional · none · ref 43 · internal anchor
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
KAN: Kolmogorov-Arnold Networks cs.LG · 2024-04-30 · conditional · none · ref 73 · internal anchor
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models cs.LG · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise cs.LG · 2026-06-28 · unverdicted · none · ref 24 · internal anchor
Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.
Quantum Generative Diffusion Model for Real-World Time Series cs.LG · 2026-06-25 · unverdicted · none · ref 15 · internal anchor
QDiffusion-TS is the first quantum generative diffusion model for time series, achieving ~44% lower Wasserstein distance on Apple and Amazon stock data and up to 71% better forecasting RMSE with ~1000x fewer parameters than classical diffusion.
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models cs.LG · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
A compute-aware framework using cumulative FLOPs shows alignment training has non-monotonic effects on robustness and attack costs vary up to 5x across harm categories.
Scaling Laws for Behavioral Foundation Models over User Event Sequences cs.LG · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
Across 600 runs from 10^15 to 10^19 FLOPs, behavioral models show a 2% embedder is compute-optimal at all scales, training is data-heavy at low compute, and optimal negatives increase with budget until memory-limited.
Spectral Scaling Laws of Muon cs.LG · 2026-06-02 · unverdicted · none · ref 6 · internal anchor
Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.
Unlocking Feature Learning in Gated Delta Networks at Scale cs.LG · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
Derives μP-style scaling rules for Gated Delta Networks and validates stable learning-rate transfer in language model pre-training experiments.
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 71 · internal anchor
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions cs.LG · 2026-05-28 · unverdicted · none · ref 16 · internal anchor
Vendi Score and scaling-law objectives belong to the class of matrix spectral functions, which are submodular, enabling efficient greedy selection of training data that outperforms random subsets in predicting held-out performance.
Gradient Transformer: Learning to Generate Updates for LLMs cs.LG · 2026-05-26 · unverdicted · none · ref 10 · internal anchor
Gradient Transformer learns to map TinyLM update vectors to LLM update vectors for data-free knowledge distillation using correlations from shadow datasets.
Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization cs.LG · 2026-05-24 · unverdicted · none · ref 22 · internal anchor
BloGDiT introduces blocked Gibbs-style denoising in diffusion transformers to enable large targeted edits for constraint satisfaction and optimization, matching or exceeding prior methods on Sudoku, graph coloring, MIS, and MaxCut.
From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression cs.LG · 2026-05-23 · unverdicted · none · ref 7 · internal anchor
Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.
TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models cs.LG · 2026-05-22 · unverdicted · none · ref 18 · internal anchor
TUBE is a new upper bound on evidence for discrete diffusion models that shows block MDMs and AO-ARMs have strictly lower likelihood than exact ARMs.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 143 · internal anchor
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging cs.LG · 2026-05-20 · unverdicted · none · ref 207 · internal anchor
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks cs.LG · 2026-05-18 · unverdicted · none · ref 7 · 2 links · internal anchor
PMF-CL derives Pareto-minimal-forgetting algorithms for linear/basis-function regression and quadratic-bounded losses like logistic regression, achieving static O(d²) memory for d-parameter models.
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method cs.LG · 2026-05-18 · unverdicted · none · ref 206 · internal anchor
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment cs.LG · 2026-05-17 · unverdicted · none · ref 39 · internal anchor
PEIRA learns predictive encoders by optimizing the trace of the optimal inter-view linear regressor, with only nontrivial global minimizers as stable equilibria that recover leading nonlinear canonical correlation subspaces.
Olivia: Harmonizing Time Series Foundation Models with Power Spectral Density cs.LG · 2026-05-17 · unverdicted · none · ref 9 · internal anchor
Olivia harmonizes time series datasets via normalized power spectral density using a Harmonizer module and resonator-based HarmonicAttention, achieving state-of-the-art zero-shot, few-shot, and full-shot forecasting on TSLib, GIFT-Eval, and GluonTS benchmarks.
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis cs.LG · 2026-05-15 · unverdicted · none · ref 65 · internal anchor
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization cs.LG · 2026-05-13 · unverdicted · none · ref 93 · internal anchor
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, and sparsity.
Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling cs.LG · 2026-05-13 · unverdicted · none · ref 1 · 2 links · internal anchor
Language models show a scale-dependent switch from anticorrelated to correlated reasoning-truthfulness coupling at a family-specific critical parameter count, with architecture and data choices shifting the transition point.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 44 · 2 links · internal anchor
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-08 · unverdicted · none · ref 16 · internal anchor
DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.
On the Invariance and Generality of Neural Scaling Laws cs.LG · 2026-05-08 · unverdicted · none · ref 30 · internal anchor
Neural scaling laws are invariant under bijective data transformations and change predictably with information resolution ρ under non-bijective transformations, enabling cross-domain transport of fitted exponents.
Adaptive Selection of LoRA Components in Privacy-Preserving Federated Learning cs.LG · 2026-05-07 · unverdicted · none · ref 62 · internal anchor
AS-LoRA adaptively chooses which LoRA factor to update per layer and round using a curvature-aware second-order score, eliminating reconstruction error floors and improving performance in DP federated learning.
The Predictive-Causal Gap: An Impossibility Theorem and Large-Scale Neural Evidence cs.LG · 2026-05-06 · unverdicted · none · ref 10 · internal anchor
Predictive representation learning structurally favors encoding slower or less noisy environment modes over causal system modes, as shown by an impossibility theorem for linear-Gaussian dynamics and large-scale neural experiments.
Budgeted LoRA: Distillation as Structured Compute Allocation for Efficient Inference cs.LG · 2026-05-05 · unverdicted · none · ref 10 · internal anchor
Budgeted LoRA treats LLM distillation as structured compute allocation under a single global budget, producing student models with tunable inference speedups of 1.74x to 4.05x while controlling perplexity and task accuracy.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 25 · internal anchor
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
Low Rank Adaptation for Adversarial Perturbation cs.LG · 2026-04-30 · unverdicted · none · ref 83 · internal anchor
Adversarial perturbations possess an inherently low-rank structure that enables more efficient and effective black-box adversarial attacks via subspace projection.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity cs.LG · 2026-04-27 · unverdicted · none · ref 10 · internal anchor
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling cs.LG · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
Rethinking Dataset Distillation: Hard Truths about Soft Labels cs.LG · 2026-04-20 · conditional · none · ref 18 · internal anchor
Soft labels hide the value of high-quality data subsets in dataset distillation, and a new compute-aware method outperforms existing approaches in hard-label settings on ImageNet-1K.
Discrete Prototypical Memories for Federated Time Series Foundation Models cs.LG · 2026-04-06 · unverdicted · none · ref 12 · internal anchor
FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure cs.LG · 2026-02-19 · unverdicted · none · ref 3 · internal anchor
Multi-task grokking in Transformers produces staggered generalization, low-dimensional manifolds, weight-decay phase structure, holographic solutions, and transverse redundancy.
Imposing Boundary Conditions on Neural Operators via Learned Function Extensions cs.LG · 2026-02-04 · unverdicted · none · ref 14 · internal anchor
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights cs.LG · 2026-02-01 · unverdicted · none · ref 7 · internal anchor
MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads cs.LG · 2026-01-29 · unverdicted · none · ref 4 · internal anchor
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
TS-Arena -- A Live Forecast Pre-Registration Platform cs.LG · 2025-12-23 · conditional · none · ref 18 · internal anchor
TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 8 · internal anchor
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
Less is More: Recursive Reasoning with Tiny Networks cs.LG · 2025-10-06 · unverdicted · none · ref 8 · internal anchor
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
Pre-trained Large Language Models Learn Hidden Markov Models In-context cs.LG · 2025-06-08 · unverdicted · none · ref 25 · internal anchor
Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.

Scaling Laws for Neural Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer