Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
super hub Canonical reference
Scaling Laws for Neural Language Models
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s
authors
co-cited works
representative citing papers
BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.
Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.
Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.
Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.
Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.
Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.
Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.
The Random Language Model exhibits a hierarchy of phase transitions in the double-scaling limit ε̃_d → 0, N → ∞ at fixed x = ε̃_d log N, with symbol correlations, non-uniform marginals, and glassy freezing, yielding scaling laws consistent with large language models.
citing papers explorer
-
LLMs Need Encoders for Semantic IDs Too
PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.
-
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
-
IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems
IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance and transferability.
-
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
-
Scaling Laws for Cross-Encoder Reranking
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
-
GenPage: Towards End-to-End Generative Homepage Construction at Netflix
GenPage is a transformer that autoregressively generates entire structured Netflix homepages from user prompts, delivering +0.24% engagement lift and 20% latency reduction versus production baseline in online tests.
-
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
-
Evaluation of Agents under Simulated AI Marketplace Dynamics
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
-
Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation
A jointly learned hierarchical index with cross-attention and residual quantization scales exact retrieval in foundational recommendation models, deployed at Meta with additional performance from test-time training on index nodes.
-
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
-
Beyond Dense Connectivity: Explicit Sparsity for Scalable Recommendation
SSR uses static random filters and iterative competitive sparse mechanisms to explicitly enforce sparsity in recommendation models, outperforming dense baselines on public and billion-scale industrial datasets.
-
Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation
HaNoRec dynamically weights harder preference samples and applies Gaussian perturbations to output distributions to improve multimodal LLM performance on sequential recommendation tasks.
-
OneRec-V2 Technical Report
OneRec-V2 scales generative recommendation to 8B parameters via decoder-only design and real-world preference alignment, improving user engagement metrics in production A/B tests.
-
Rec-Distill: An Industrial Distillation Pipeline for Large-Scale Recommendation Models
Rec-Distill is an industrial distillation pipeline that transfers substantial performance from large-scale recommendation models to efficient students, reporting over 60% transferability and measurable business gains.
-
Joint Model Parameter Scaling and Universal-Domain Data Integration for E-commerce Search Ranking
UniScale couples entire-space data construction with a hierarchical fusion transformer to improve scaling behavior and deliver 1.70% purchase and 2.04% GMV lifts in large-scale e-commerce search A/B tests.
-
Less LLM, More Documents: Searching for Improved RAG
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
-
On the Practice of Scaling Search Conversion Rate Prediction
Empirical scaling of backbone, embeddings, and data shows largely independent additive gains, enabling a deployed model with 2.5x data and 8x compute that delivers +2.6% CVR improvement with minimal latency change.