super hub Canonical reference

Scaling Laws for Neural Language Models

Benjamin Chess, Jared Kaplan, Rewon Child, Sam McCandlish, Tom B Brown, Tom Henighan · 2020 · cs.LG · arXiv 2001.08361

Canonical reference. 84% of citing Pith papers cite this work as background.

832 Pith papers citing it

Background 84% of classified citations

open full Pith review browse 832 citing papers more from Benjamin Chess arXiv PDF

abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 6 dataset 3 baseline 2 other 2

citation-polarity summary

background 112 unclear 8 use method 6 support 3 use dataset 3 baseline 2

claims ledger

abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s

authors

Benjamin Chess Jared Kaplan Rewon Child Sam McCandlish Tom B Brown Tom Henighan

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

econ.GN · 2026-05-19 · unverdicted · novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

cs.LG · 2025-05-30 · unverdicted · novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

citing papers explorer

Showing 50 of 832 citing papers.

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories cs.CL · 2026-06-17 · unverdicted · none · ref 4 · internal anchor
RegMix-D fits regression models to proxy loss trajectories to produce dynamic data mixture schedules that outperform static RegMix and DoReMi on 25B-token Pile pretraining with a 1B model.
How Well Can Your Video Model Remember? Measuring Memory-Budget Trade-offs in Long Video Understanding cs.CV · 2026-06-17 · unverdicted · none · ref 7 · internal anchor
Fits a model where logit-accuracy scales linearly in log frame budget B with distance-dependent exponent α(D) that decays log-linearly with temporal distance D, based on 155k binary predictions across ten models.
TS-Fault: Benchmarking Time Series Forecasters Against Structural Faults cs.LG · 2026-06-16 · unverdicted · none · ref 26 · internal anchor
TS-Fault benchmark finds clean-data accuracy anti-correlates with robustness to structural faults, with all catastrophic failures under mechanism-level faults and foundation models most fragile.
Variable-Width Transformers cs.CL · 2026-06-16 · conditional · none · ref 19 · internal anchor
×-shaped variable-width transformers outperform parameter-matched uniform baselines on language modeling loss with 22% fewer FLOPs and 15% smaller KV cache.
DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models cs.CV · 2026-06-15 · unverdicted · none · ref 19 · internal anchor
DriveJudge combines VLM reasoning with rule functions on a new 33,577-sample human-annotated dataset, outperforming EPDMS by 21.23 AUC on quality classification and DriveCritic by 6.5% on trajectory preference.
How Post-Training Shapes Biological Reasoning Models cs.LG · 2026-06-15 · unverdicted · none · ref 60 · internal anchor
Post-training stages reshape generalization in biological reasoning models distinctly: CPT aligns with biological language, SFT boosts ID performance but causes OOD to peak early and decline, while RL on strong SFT checkpoints can recover OOD generalization.
Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining cs.LG · 2026-06-15 · unverdicted · none · ref 1 · internal anchor
Training-time augmentations in token noise, permutation, and offset categories reduce overfitting and improve minimum validation loss in multi-epoch autoregressive pretraining on fixed corpora.
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning cs.AI · 2026-06-11 · unverdicted · none · ref 10 · internal anchor
Humans and LLMs exhibit similar error patterns in common-sense reasoning, consistent with shared pattern-matching mechanisms rather than abstract world models.
Viral Proteins Reveal Geometry of Protein Language Models cs.LG · 2026-06-10 · unverdicted · none · ref 44 · internal anchor
Viral proteins expose a nativeness axis in pLM embeddings aligned with masked perplexity, with retained linear separability for viral signals beyond perplexity and sequence features.
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics cs.LG · 2026-06-10 · unverdicted · none · ref 14 · internal anchor
DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.
Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality cs.CL · 2026-06-09 · conditional · none · ref 3 · internal anchor
Web graph centrality from Common Crawl supplies an orthogonal signal for pretraining data selection that improves language model performance when central and peripheral hosts are balanced.
Drawing with Strangers: Population Scaling Drives Zero-Shot Mutual Intelligibility in Emergent Sketching cs.LG · 2026-06-09 · unverdicted · none · ref 64 · internal anchor
Scaling population size during training of emergent sketching agents increases zero-shot mutual intelligibility between independent groups by raising in-group variation and driving perceptual grounding.
A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training math.OC · 2026-06-09 · unverdicted · none · ref 10 · internal anchor
Derives mean-field Wasserstein gradient flow for cross-entropy trained causal multi-head self-attention, with finite-head approximation bounds, propagation-of-chaos, and convergence/stability results under compactness and monotonicity assumptions.
OmniGen-AR: AutoRegressive Any-to-Image Generation cs.CV · 2026-06-08 · unverdicted · none · ref 31 · internal anchor
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality math.OC · 2026-06-07 · unverdicted · none · ref 53 · 2 links · internal anchor
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
Chiaroscuro Attention: Spending Compute in the Dark cs.CL · 2026-06-06 · unverdicted · none · ref 6 · internal anchor
CHIAR-Former routes tokens via spectral entropy to DCT mixing or attention, yielding 35-40% FLOP savings at 400M parameters with modest perplexity increase on WikiText-103.
Explaining Data Mixing Scaling Laws cs.LG · 2026-06-06 · unverdicted · none · ref 10 · internal anchor
A framework using capacity competition and noise reduction under an overlapping-skills assumption explains multi-domain loss behaviors and extrapolates optimal mixtures to large scales from small-scale fits with fewer parameters.
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency cs.LG · 2026-06-05 · unverdicted · none · ref 14 · internal anchor
PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.
Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings cs.CL · 2026-06-05 · unverdicted · none · ref 17 · internal anchor
EmbedFilter applies a linear filter derived from the LLM unembedding matrix to suppress high-frequency token influences in text embeddings, yielding improved zero-shot performance and inherent dimensionality reduction.
Decoding Naturalistic Emotion Dynamics from the Brain: An LLM-Enhanced Regression Framework cs.LG · 2026-06-05 · unverdicted · none · ref 17 · internal anchor
A multi-target regression framework uses LLM-derived continuous sentiment profiles from narratives and dynamic functional connectivity from fMRI to track naturalistic emotional trajectories, outperforming static ROI measures.
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling cs.AI · 2026-06-05 · unverdicted · none · ref 16 · internal anchor
DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.
Optimized Sampling of Angle-Resolved Scatterometry Data Using End-to-End Compressed Learning Model for Nanograss Deficiency Detection eess.SP · 2026-06-05 · unverdicted · none · ref 32 · internal anchor
An end-to-end learnable latitude-based sampling layer with CNN matches full ARS image accuracy for 5-level nanograss deficiency classification using up to 99.7% fewer sampling points.
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws cs.LG · 2026-06-05 · unverdicted · none · ref 30 · internal anchor
MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.
Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation cs.CL · 2026-06-04 · unverdicted · none · ref 31 · internal anchor
On-policy distillation from a frozen autoregressive teacher to a bidirectional student eliminates train-inference mismatch and enables data-efficient ARLM-to-DLM conversion.
Pretraining Recurrent Networks without Recurrence cs.LG · 2026-06-04 · unverdicted · none · ref 65 · internal anchor
SMT reduces RNN training to supervised learning on memory transitions (m_t, x_{t+1}) to m_{t+1} obtained from a Transformer encoder, enabling time-parallel training with O(1) gradient paths.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training cs.CL · 2026-06-04 · unverdicted · none · ref 1 · internal anchor
Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.
HORIZON: Recoverability-Governed Curriculum for Physical-Domain Scaling cs.RO · 2026-06-03 · unverdicted · none · ref 1 · internal anchor
HORIZON is a recoverability-governed checkpointed frontier curriculum for on-policy physical-domain scaling on quadruped locomotion that identifies three regularities: uneven widening, non-monotonic composition, and the necessity of joint on-policy interaction.
Validity Threats for Foundation Model Research cs.LG · 2026-06-03 · accept · none · ref 47 · internal anchor
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Dual-Stream MLP is All You Need for CTR Prediction cs.IR · 2026-06-03 · unverdicted · none · ref 19 · internal anchor
DS-MLP achieves state-of-the-art CTR prediction on three benchmarks using a final vanilla MLP structure trained via knowledge distillation and two alignment strategies.
ReSGA: A Large Tail Risk Model for Learning Value-at-Risk and Expected Shortfall stat.ML · 2026-06-03 · unverdicted · none · ref 31 · internal anchor
ReSGA, a large autoencoder, outperforms prior methods on joint VaR-ES forecasting for US equities and converts the edge into economic gains via a size-enhanced momentum strategy, with gains attributed to data complexity.
Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling cs.CL · 2026-06-02 · unverdicted · none · ref 94 · internal anchor
RL-trained lightweight controller using answer statistics improves trade-offs among correctness, latency, and total samples in adaptive sampling for LLM test-time scaling.
BYORn: Bootstrap Your Own Responses to Defend Large Vision-Language Models Against Backdoor Attacks cs.LG · 2026-06-01 · unverdicted · none · ref 105 · internal anchor
BYORn defends autoregressive vision-language models against backdoor attacks in supervised fine-tuning by dynamically replacing semantically implausible poisoned responses with model-generated alternatives, improving robustness while preserving clean performance.
Do Transformers Need Three Projections? Systematic Study of QKV Variants cs.LG · 2026-06-01 · conditional · none · ref 89 · internal anchor
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
Scaling Laws for Neural-Network Quantum States cond-mat.dis-nn · 2026-06-01 · unverdicted · none · ref 1 · internal anchor
Transformer wave functions for the J1-J2 Heisenberg model exhibit size-independent power-law decay of V-score with compute, with the exponent decreasing as frustration increases.
O-POPE: High-Frequency Pipelined Outer Product based GEMM acceleration with minimal buffering overhead cs.AR · 2026-06-01 · unverdicted · none · ref 3 · internal anchor
O-POPE is an outer-product GEMM accelerator that repurposes FPU pipeline registers for buffering to reach 1 GHz in 12 nm FINFET with under 2% buffer area and 99.97% utilization.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 132 · internal anchor
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location, Within-Word Landing Position, and Fixation Duration in Reading cs.CL · 2026-06-01 · unverdicted · none · ref 91 · internal anchor
Eyettention II is a new dual-sequence deep-learning model that generates realistic reading scanpaths with fixation location, landing position, and duration, outperforming prior models while reproducing key psycholinguistic effects.
When Data Is Scarce: Scaling Sparse Language Models with Repeated Training cs.LG · 2026-05-31 · unverdicted · none · ref 9 · internal anchor
Sparse LLMs in data-scarce multi-epoch regimes follow a scaling law based on active parameters, unique tokens, repetition count, and sparsity level that predicts performance and delays data saturation.
Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning cs.LG · 2026-05-31 · unverdicted · none · ref 6 · internal anchor
Local MixVR achieves communication complexity scaling only with number of workers M, independent of total samples N, and outperforms Minibatch Accelerated SGD when M is smaller than order N to the 1/4.
ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks cs.LG · 2026-05-31 · unverdicted · none · ref 1 · internal anchor
ThinkSwitch uses iterative self-distillation with QLoRA and spherical weight interpolation to raise both instruct and thinking checkpoint accuracy on small AIME and PubMedQA sets using only 15 human prompts per domain.
Learning Multi-Modal Trajectory Policies for Data-Efficient Robotic Manipulation cs.RO · 2026-05-31 · unverdicted · none · ref 26 · internal anchor
MATE is a multi-modal MoE trajectory policy using a cosine router and stochastic noise to improve expert balance, reporting 4.75% higher average success rate than prior methods on LIBERO under data scarcity.
Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs cs.CR · 2026-05-30 · unverdicted · none · ref 9 · internal anchor
Non-monotonic safety alignment appears in Gemma models, with Gemma 3 at 68.7% ASR versus 45.5% in Gemma 2 and 33.9% in Gemma 4 via MAP-Elites red-teaming and cross-generational attack transfer.
Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models cs.LG · 2026-05-30 · unverdicted · none · ref 14 · internal anchor
Proves row-space criterion for finite determinacy in linear finite-field tasks, NP-completeness of minimal forcing subcontext, and anti-mirage theorem separating threshold metrics from semantic confidence via Keisler measures.
RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video cs.CV · 2026-05-29 · unverdicted · none · ref 31 · internal anchor
RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.
The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning cs.CL · 2026-05-29 · unverdicted · none · ref 13 · internal anchor
Experiments reveal that topological cues robustly support LLM navigation planning while incorrect semantic cues derail it, with linguistic format effects varying by model size and compression.
TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering q-bio.QM · 2026-05-29 · unverdicted · none · ref 134 · internal anchor
TadA-Bench supplies a chronological million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds that evaluates models on future-round variant ranking given only earlier data.
Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention cs.LG · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 41 · internal anchor
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Inferring the Size of Large Language Models From Popular Text Memorization cs.LG · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
A method infers conservative lower bounds on LLM parameter counts from next-token accuracy profiles on popular texts using pairwise tests and PCA-based scaling-law estimation.
Bilinear Coordinate Alignment for Training-Free Task-Vector Transfer cs.LG · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
BiCo transfers task vectors across models differing in width, depth, and pre-training by estimating dual-space orthogonal Procrustes mappings from one forward-backward pass on a calibration set.

Scaling Laws for Neural Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer