super hub Canonical reference

Scaling Laws for Neural Language Models

Benjamin Chess, Jared Kaplan, Rewon Child, Sam McCandlish, Tom B Brown, Tom Henighan · 2020 · cs.LG · arXiv 2001.08361

Canonical reference. 83% of citing Pith papers cite this work as background.

695 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 695 citing papers more from Benjamin Chess arXiv PDF

abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 119 method 6 dataset 3 baseline 2 other 2

citation-polarity summary

background 110 unclear 8 use method 6 support 3 use dataset 3 baseline 2

claims ledger

abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s

authors

Benjamin Chess Jared Kaplan Rewon Child Sam McCandlish Tom B Brown Tom Henighan

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

econ.GN · 2026-05-19 · unverdicted · novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

cs.LG · 2025-05-30 · unverdicted · novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

A compute-aware framework using cumulative FLOPs shows alignment training has non-monotonic effects on robustness and attack costs vary up to 5x across harm categories.

Provable Data Scaling Law for Meta Learning via Complexity Minimization

stat.ML · 2026-06-01 · unverdicted · novelty 7.0

A novel complexity minimization meta-learning framework provably demonstrates that few-shot adaptation error decreases as meta-training data volume increases.

citing papers explorer

Showing 50 of 695 citing papers.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 27 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
An Open-Source Training Dataset for Foundation Models for Black-box Optimization cs.LG · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets econ.GN · 2026-05-19 · unverdicted · none · ref 21 · internal anchor
Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters quant-ph · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
Nearly Optimal Attention Coresets cs.DS · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 22 · internal anchor
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry cs.LG · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking cs.LG · 2026-02-18 · unverdicted · none · ref 2 · internal anchor
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 3 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods cs.LG · 2025-06-12 · unverdicted · none · ref 7 · internal anchor
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States cs.LG · 2025-05-30 · unverdicted · none · ref 14 · internal anchor
Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 109 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 29 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States cs.LG · 2024-07-05 · conditional · none · ref 43 · internal anchor
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
KAN: Kolmogorov-Arnold Networks cs.LG · 2024-04-30 · conditional · none · ref 73 · internal anchor
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? cs.CL · 2023-05-12 · conditional · none · ref 16 · internal anchor
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 118 · internal anchor
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Discovering Language Model Behaviors with Model-Written Evaluations cs.CL · 2022-12-19 · unverdicted · none · ref 7 · internal anchor
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 112 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 86 · internal anchor
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
Smooth Scaling Laws Hide Stepwise Token Learning cs.CL · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.
Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise cs.LG · 2026-06-28 · unverdicted · none · ref 24 · internal anchor
Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models cs.LG · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
A compute-aware framework using cumulative FLOPs shows alignment training has non-monotonic effects on robustness and attack costs vary up to 5x across harm categories.
Provable Data Scaling Law for Meta Learning via Complexity Minimization stat.ML · 2026-06-01 · unverdicted · none · ref 73 · internal anchor
A novel complexity minimization meta-learning framework provably demonstrates that few-shot adaptation error decreases as meta-training data volume increases.
LLMs Need Encoders for Semantic IDs Too cs.IR · 2026-05-29 · unverdicted · none · ref 19 · internal anchor
PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.
D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training cs.CL · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.
What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness cs.CV · 2026-05-29 · unverdicted · none · ref 25 · internal anchor
The study links three LVLM architectural dimensions to three hallucination types via a new benchmark, finding that language foundation quality reduces co-occurrence errors, visual encoder strength reduces similarity errors, alignment reduces uncertainty errors, and joint visual-alignment improvement
Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them cs.LG · 2026-05-29 · conditional · none · ref 71 · internal anchor
Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.
YoCausal: How Far is Video Generation from World Model? A Causality Perspective cs.CV · 2026-05-28 · unverdicted · none · ref 56 · internal anchor
YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.
Latent Performance Profiling of Large Language Models cs.CL · 2026-05-28 · unverdicted · none · ref 27 · internal anchor
Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions cs.LG · 2026-05-28 · unverdicted · none · ref 16 · internal anchor
Vendi Score and scaling-law objectives belong to the class of matrix spectral functions, which are submodular, enabling efficient greedy selection of training data that outperforms random subsets in predicting held-out performance.
Neural Scaling Laws for Jet Generation hep-ph · 2026-05-27 · unverdicted · none · ref 5 · internal anchor
Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm cs.CL · 2026-05-27 · unverdicted · none · ref 58 · internal anchor
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Gradient Transformer: Learning to Generate Updates for LLMs cs.LG · 2026-05-26 · unverdicted · none · ref 10 · internal anchor
Gradient Transformer learns to map TinyLM update vectors to LLM update vectors for data-free knowledge distillation using correlations from shadow datasets.
Neural Autoregressive Control Variates for the Quantum Monte Carlo Sign Problem cond-mat.str-el · 2026-05-26 · unverdicted · none · ref 43 · internal anchor
Paired autoregressive transformers with disjoint sign-sector support generate zero-mean control variates that cut the standard error of the average sign by up to 10x and the energy estimator by 3-5x in small-N SSE simulations of the triangular-lattice Heisenberg antiferromagnet.
FTibSuite: A Comprehensive Resource Suite for Tibetan Vision-Language Modeling cs.CV · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
FTibSuite provides human-verified multimodal corpora, Tibetan-adapted benchmarks with quality controls, and a baseline VLM showing gains on tasks like MMBench while preserving Chinese capabilities.
Blocked Gibbs meets Diffusion Transformers: Unsupervised Learning for Constraint Optimization cs.LG · 2026-05-24 · unverdicted · none · ref 22 · internal anchor
BloGDiT introduces blocked Gibbs-style denoising in diffusion transformers to enable large targeted edits for constraint satisfaction and optimization, matching or exceeding prior methods on Sudoku, graph coloring, MIS, and MaxCut.
Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression cs.IT · 2026-05-24 · unverdicted · none · ref 31 · internal anchor
Under a polynomial context-truncation sensitivity assumption, suffix-only KV cache policies require per-token memory scaling as Θ(ε^{-1/α}) to achieve distortion ε.
From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression cs.LG · 2026-05-23 · unverdicted · none · ref 7 · internal anchor
Derives mini-batch scaling laws for sketched linear regression, with shared approximation terms and protocol-specific variance/fluctuation scalings under power-law spectrum and source condition.
TUBE: Tangent Upper Bound on Evidence for Discrete Diffusion Language Models cs.LG · 2026-05-22 · unverdicted · none · ref 18 · internal anchor
TUBE is a new upper bound on evidence for discrete diffusion models that shows block MDMs and AO-ARMs have strictly lower likelihood than exact ARMs.
A generative pre-trained transformer with Kerr-soliton attention physics.optics · 2026-05-22 · unverdicted · none · ref 5 · internal anchor
Kerr-soliton attention realizes transformer attention in physical hardware via Kerr solitons in a resonator, with analytic training and experimental inference showing high-fidelity agreement between hardware and model.
Tokenisation via Convex Relaxations cs.CL · 2026-05-21 · unverdicted · none · ref 44 · internal anchor
ConvexTok uses convex relaxation of tokenization to a linear program, improving intrinsic metrics, bits-per-byte, and some downstream tasks while certifying near-optimality within 1% at typical vocabulary sizes.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 4 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most cs.AI · 2026-05-21 · unverdicted · none · ref 25 · 2 links · internal anchor
More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
Uniform-in-Time Weak Propagation-of-Chaos in Shallow Neural Networks stat.ML · 2026-05-21 · unverdicted · none · ref 6 · internal anchor
Finite-width shallow networks remain within poly(d) m^{-min(1,c/6)} of their mean-field limit uniformly in time when mean-field excess loss decays as t^{-c} under standard regularity and an integral condition on the loss.
Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training cs.CV · 2026-05-20 · unverdicted · none · ref 23 · internal anchor
AutoScale is a closed-loop data engine using Graph-RAE for scene representation and Cluster-GA for importance-based retrieval to improve real-synthetic co-training for autonomous driving.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution cs.CV · 2026-05-20 · conditional · none · ref 27 · internal anchor
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Scalable Reinforcement Learning via Adaptive Batch Scaling stat.ML · 2026-05-20 · unverdicted · none · ref 11 · 2 links · internal anchor
ABS uses Behavioral Divergence to adaptively scale batch sizes in RL according to policy volatility, enabling effective large-batch large-network training on ALE benchmarks.
Provable Joint Decontamination for Benchmarking Multiple Large Language Models cs.LG · 2026-05-20 · unverdicted · none · ref 143 · internal anchor
JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging cs.LG · 2026-05-20 · unverdicted · none · ref 207 · internal anchor
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

Scaling Laws for Neural Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer