super hub Canonical reference

Scaling Laws for Neural Language Models

Benjamin Chess, Jared Kaplan, Rewon Child, Sam McCandlish, Tom B Brown, Tom Henighan · 2020 · cs.LG · arXiv 2001.08361

Canonical reference. 84% of citing Pith papers cite this work as background.

876 Pith papers citing it

Background 84% of classified citations

open full Pith review browse 876 citing papers more from Benjamin Chess arXiv PDF

abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 6 dataset 3 baseline 2 other 2

citation-polarity summary

background 112 unclear 8 use method 6 support 3 use dataset 3 baseline 2

claims ledger

abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s

authors

Benjamin Chess Jared Kaplan Rewon Child Sam McCandlish Tom B Brown Tom Henighan

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

econ.GN · 2026-05-19 · unverdicted · novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

cs.LG · 2025-05-30 · unverdicted · novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

citing papers explorer

Showing 50 of 876 citing papers.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models cs.CL · 2022-01-28 · accept · none · ref 27 · internal anchor
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
An Open-Source Training Dataset for Foundation Models for Black-box Optimization cs.LG · 2026-05-22 · unverdicted · none · ref 27 · internal anchor
BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets econ.GN · 2026-05-19 · unverdicted · none · ref 21 · internal anchor
Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.
Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters quant-ph · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
Nearly Optimal Attention Coresets cs.DS · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
Efficient Training on Multiple Consumer GPUs with RoundPipe cs.DC · 2026-04-29 · conditional · none · ref 22 · internal anchor
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry cs.LG · 2026-04-03 · unverdicted · none · ref 8 · internal anchor
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking cs.LG · 2026-02-18 · unverdicted · none · ref 2 · internal anchor
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 3 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods cs.LG · 2025-06-12 · unverdicted · none · ref 7 · internal anchor
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States cs.LG · 2025-05-30 · unverdicted · none · ref 14 · internal anchor
Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.
Large Language Diffusion Models cs.CL · 2025-02-14 · unverdicted · none · ref 109 · internal anchor
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection cs.CL · 2024-10-06 · unverdicted · none · ref 29 · internal anchor
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States cs.LG · 2024-07-05 · conditional · none · ref 43 · internal anchor
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
KAN: Kolmogorov-Arnold Networks cs.LG · 2024-04-30 · conditional · none · ref 73 · internal anchor
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? cs.CL · 2023-05-12 · conditional · none · ref 16 · internal anchor
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 118 · internal anchor
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Discovering Language Model Behaviors with Model-Written Evaluations cs.CL · 2022-12-19 · unverdicted · none · ref 7 · internal anchor
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling cs.CL · 2020-12-31 · conditional · none · ref 112 · internal anchor
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures cs.LG · 2026-07-02 · unverdicted · none · ref 36 · internal anchor
HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.
Agentic generation of verifiable rules for deterministic, self-expanding reaction classification cs.AI · 2026-07-01 · unverdicted · none · ref 28 · internal anchor
Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.
SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models cs.LG · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph cs.CL · 2026-06-29 · unverdicted · none · ref 86 · internal anchor
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
Smooth Scaling Laws Hide Stepwise Token Learning cs.CL · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.
Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise cs.LG · 2026-06-28 · unverdicted · none · ref 24 · internal anchor
Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.
Phase structure of the Random Language Model cond-mat.dis-nn · 2026-06-26 · unverdicted · none · ref 4 · internal anchor
The Random Language Model exhibits a hierarchy of phase transitions in the double-scaling limit ε̃_d → 0, N → ∞ at fixed x = ε̃_d log N, with symbol correlations, non-uniform marginals, and glassy freezing, yielding scaling laws consistent with large language models.
Quantum Generative Diffusion Model for Real-World Time Series cs.LG · 2026-06-25 · unverdicted · none · ref 15 · internal anchor
QDiffusion-TS is the first quantum generative diffusion model for time series, achieving ~44% lower Wasserstein distance on Apple and Amazon stock data and up to 71% better forecasting RMSE with ~1000x fewer parameters than classical diffusion.
Structure Before Collapse: Transient semantic geometry in next-token prediction cs.LG · 2026-06-25 · unverdicted · none · ref 96 · internal anchor
Semantic geometry emerges transiently early in next-token prediction training before collapsing to Neural Collapse symmetry in synthetic settings with latent semantic factors.
Sketched Linear Contrastive Learning: Approximation, Optimization, and Statistical Scaling cs.LG · 2026-06-25 · unverdicted · none · ref 2 · internal anchor
Derives explicit scaling law for risk in sketched linear contrastive learning w.r.t. sketch dimension M, sample size N, and optimization horizon under paired Gaussian and power-law assumptions.
The Inference-Compute Frontier and a Latency-Efficient Architecture for Limit Order Book Prediction cs.LG · 2026-06-24 · unverdicted · none · ref 14 · internal anchor
Empirical power-law frontier between predictive loss and structural forward work in LOB models extrapolates to held-out high-compute architectures with R²=0.941, motivating FastBiNLOB which exceeds SOTA macro-F1 at lower latency.
The Generalization Spectrum: A Chromatographic Approach to Evaluating Learning Algorithms cs.LG · 2026-06-24 · unverdicted · none · ref 23 · 2 links · internal anchor
Introduces the Generalization Spectrum evaluation framework to track per-example generalization across transfer distances in competitive programming tasks.
Revealing Training Data Exposure in Vision Language Large Models via Parameter Gradients cs.CV · 2026-06-23 · unverdicted · none · ref 38 · internal anchor
GradAudit detects training data exposure in VLLMs by analyzing gradient stability on image-text pairs and outperforms baselines on medical and general datasets.
The Pitfall of Scaling Up: Uncovering and Mitigating Popularity Bias Amplification in Scaling Transformer-based Recommenders cs.IR · 2026-06-20 · unverdicted · none · ref 38 · internal anchor
Transformer recommenders amplify popularity bias via spectral collapse when scaled; SPRINT constrains attention column-sums and feed-forward spectral norms to improve fairness and scaling behavior.
Scaling Generative Foundation Models for Chest Radiography with Rectified Flow Transformers cs.CV · 2026-06-17 · unverdicted · none · ref 25 · internal anchor
A 1.3B-parameter rectified flow transformer is the first generative foundation model for chest radiograph synthesis at billion-parameter scale, producing images indistinguishable from real ones to experts.
Recursive Scaling in Masked Diffusion Models cs.LG · 2026-06-16 · unverdicted · none · ref 22 · internal anchor
Recursive Masked Diffusion Models add recursive depth via repeated application of the same transformer to improve parameter efficiency and reduce inference steps in masked diffusion models.
Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence stat.ML · 2026-06-10 · unverdicted · none · ref 2 · internal anchor
Bayesian reduction of attention posterior on copy task predicts first-order phase transition for softmax attention and second-order followed by crossover for linear attention.
Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models cs.LG · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
A compute-aware framework using cumulative FLOPs shows alignment training has non-monotonic effects on robustness and attack costs vary up to 5x across harm categories.
Capacity, Not Format: Rethinking Structured Reasoning Failures cs.AI · 2026-06-08 · unverdicted · none · ref 4 · internal anchor
Empirical study across 4 models and 5 benchmarks finds that structured output formats degrade LLM reasoning performance primarily in capacity-limited models, with recovery via delayed formatting.
How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap cs.LG · 2026-06-07 · conditional · none · ref 30 · internal anchor
EEG denoising saturates at 3-6.5K parameters on standard benchmarks; reconstruction metrics do not predict and can degrade motor-imagery classification utility.
Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics cs.CL · 2026-06-07 · accept · none · ref 16 · internal anchor
Naive samplers beat published diffusion and flow models on gen-PPL with incoherent output, proving the metric unsound and motivating distributional evaluation suites.
Complexity-Balanced Diffusion Splitting cs.CV · 2026-06-04 · unverdicted · none · ref 16 · internal anchor
CBS partitions the diffusion timeline into segments of equal approximation burden via Dirichlet energy and trajectory acceleration monitors estimated by an auxiliary model, yielding higher synthesis quality at fixed per-step cost across SiT, JiT and UNet backbones.
Scaling Laws for Behavioral Foundation Models over User Event Sequences cs.LG · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
Across 600 runs from 10^15 to 10^19 FLOPs, behavioral models show a 2% embedder is compute-optimal at all scales, training is data-heavy at low compute, and optimal negatives increase with budget until memory-limited.
Depth-Attention: Cross-Layer Value Mixing for Language Models cs.CL · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
Depth-Attention mixes values from earlier layers into the current attention value by having the query attend to previous-layer keys at the same position, yielding lower perplexity and up to 2.3 points higher average accuracy than vanilla transformers on Qwen3-style models with negligible extra FLOPs
Spectral Scaling Laws of Muon cs.LG · 2026-06-02 · unverdicted · none · ref 6 · internal anchor
Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.
Unlocking Feature Learning in Gated Delta Networks at Scale cs.LG · 2026-06-02 · unverdicted · none · ref 3 · internal anchor
Derives μP-style scaling rules for Gated Delta Networks and validates stable learning-rate transfer in language model pre-training experiments.
DPA4: Pushing the Accuracy-Cost Frontier of Interatomic Potentials with EMFA SO(2) Convolution physics.chem-ph · 2026-06-01 · unverdicted · none · ref 282 · internal anchor
DPA4 is a new SE(3)-equivariant interatomic potential with EMFA SO(2) convolution that sets new accuracy-cost records on Matbench Discovery and SPICE benchmarks using fewer parameters than prior models.
Provable Data Scaling Law for Meta Learning via Complexity Minimization stat.ML · 2026-06-01 · unverdicted · none · ref 6 · 2 links · internal anchor
A novel complexity minimization meta-learning framework provably demonstrates that few-shot adaptation error decreases as meta-training data volume increases.
The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size physics.soc-ph · 2026-05-31 · conditional · none · ref 21 · internal anchor
A derived scaling law R(N) = 1/(1 + c(N-1)N^{-β}) fits answer diversity and correctness across 44 LLM multi-agent conditions with R² > 0.99, classifying regimes by β and showing only heterogeneous teams escape hard-ceiling saturation.
LLMs Need Encoders for Semantic IDs Too cs.IR · 2026-05-29 · unverdicted · none · ref 19 · internal anchor
PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.
Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation cs.SE · 2026-05-29 · unverdicted · none · ref 43 · internal anchor
PowerCodeBench and a boundary-aware intervention raise LLM accuracy on power-system code generation by 32-56 points across ten open-weight models and four commercial APIs on a 2,000-task benchmark.

Scaling Laws for Neural Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer