super hub Canonical reference

Scaling Laws for Neural Language Models

Benjamin Chess, Jared Kaplan, Rewon Child, Sam McCandlish, Tom B Brown, Tom Henighan · 2020 · cs.LG · arXiv 2001.08361

Canonical reference. 84% of citing Pith papers cite this work as background.

841 Pith papers citing it

Background 84% of classified citations

open full Pith review browse 841 citing papers more from Benjamin Chess arXiv PDF

abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 121 method 6 dataset 3 baseline 2 other 2

citation-polarity summary

background 112 unclear 8 use method 6 support 3 use dataset 3 baseline 2

claims ledger

abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s

authors

Benjamin Chess Jared Kaplan Rewon Child Sam McCandlish Tom B Brown Tom Henighan

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

econ.GN · 2026-05-19 · unverdicted · novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

cs.LG · 2025-05-30 · unverdicted · novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

citing papers explorer

Showing 50 of 841 citing papers.

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention cs.LG · 2026-05-28 · unverdicted · none · ref 20 · internal anchor
Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet cs.AI · 2026-05-28 · unverdicted · none · ref 41 · internal anchor
Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.
Inferring the Size of Large Language Models From Popular Text Memorization cs.LG · 2026-05-28 · unverdicted · none · ref 1 · internal anchor
A method infers conservative lower bounds on LLM parameter counts from next-token accuracy profiles on popular texts using pairwise tests and PCA-based scaling-law estimation.
Bilinear Coordinate Alignment for Training-Free Task-Vector Transfer cs.LG · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
BiCo transfers task vectors across models differing in width, depth, and pre-training by estimating dual-space orthogonal Procrustes mappings from one forward-backward pass on a calibration set.
No Safe Dose: How Training Data Drives Unsafe Image Generation cs.CV · 2026-05-27 · unverdicted · none · ref 29 · internal anchor
Proportion of unsafe images in training data directly increases unsafe outputs in text-to-image models, independent of absolute count, with complementary risk reduction from safer text encoders.
Throughput-Optimized Networks at Scale cs.NI · 2026-05-27 · unverdicted · none · ref 49 · internal anchor
TONS uses linear optimization and heuristics to synthesize deadlock-free network topologies and routing for datacenter AI training, reporting 2.1x and 1.6x geometric mean speedups over best TPU torus variants for uniform random and all-to-all traffic in simulation.
Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain cs.DC · 2026-05-27 · unverdicted · none · ref 18 · internal anchor
Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.
MobileMoE: Scaling On-Device Mixture of Experts cs.LG · 2026-05-26 · unverdicted · none · ref 29 · internal anchor
MobileMoE introduces on-device MoE LLMs that match dense models with 2-4x fewer FLOPs and provide efficient smartphone inference.
Automating Formal Verification with Agent-Guided Tree Search cs.LO · 2026-05-26 · unverdicted · none · ref 25 · internal anchor
Agent-directed tree search improves LLM performance on Lean formal verification tasks, with context-based orchestration solving more intermediate specs at lower token cost than baseline agents.
Looped Diffusion Language Models cs.LG · 2026-05-25 · conditional · none · ref 30 · internal anchor
LoopMDM loops early-middle layers in masked diffusion models to match same-size MDM performance with up to 3.3x fewer training FLOPs and outperform on reasoning tasks by up to 8.5 points on GSM8K.
Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers cs.LG · 2026-05-25 · unverdicted · none · ref 23 · internal anchor
WaveLiT combines wavelet tokenization, linear attention, and multiscale pyramids to produce parameter-efficient neural PDE solvers that match much larger models on TheWell benchmarks.
C3P: Contrastive promoter-protein pretraining yields representations capturing bacterial gene regulation q-bio.GN · 2026-05-24 · unverdicted · none · ref 17 · internal anchor
C3P pretrains promoter representations via contrastive alignment to protein embeddings on 88M bacterial pairs, delivering multi-fold gains on regulatory tasks and zero-shot co-regulation retrieval unlike gLMs.
NITP: Next Implicit Token Prediction for LLM Pre-training cs.CL · 2026-05-24 · unverdicted · none · ref 22 · 2 links · internal anchor
NITP augments standard next-token prediction with implicit semantic prediction in representation space using shallow-layer self-supervision, reporting consistent downstream gains on 0.5B-9B models including 5.7% on MMLU-Pro for a 9B MoE.
Divide-and-Conquer Inference for Large-Scale Visual Recognition with Multimodal Large Language Models cs.CV · 2026-05-24 · unverdicted · none · ref 2 · internal anchor
DCI decomposes large-scale visual recognition into simpler subproblems with dynamic pruning to raise MLLM accuracy on datasets like ImageNet-1K and 21K.
LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws cs.LG · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythia and OLMo2 data.
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference cs.AR · 2026-05-22 · unverdicted · none · ref 8 · internal anchor
NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.
Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation cs.LG · 2026-05-22 · unverdicted · none · ref 18 · internal anchor
RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.
PRAXIS: Case-distilled and code-verified AI agents for biological research q-bio.QM · 2026-05-22 · unverdicted · none · ref 5 · 2 links · internal anchor
PRAXIS converts literature cases into long-term memory for AI agents to handle problem definition, method selection, workflow execution, and review in biocomputational tasks.
General-Purpose Photonic Computing Primitive for Contemporary Artificial Intelligence physics.optics · 2026-05-21 · unverdicted · none · ref 6 · internal anchor
DUET is a photonic tensor core paradigm that uses structural symmetry in VODICs to support arbitrary signed operands directly, experimentally tested on image classification, segmentation, and Transformer tasks.
A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification cs.LG · 2026-05-21 · unverdicted · none · ref 3 · internal anchor
Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.
Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws cs.LG · 2026-05-20 · unverdicted · none · ref 1 · internal anchor
The same Transformer architecture follows different spectral scaling laws under different optimizers, with Muon achieving linear hard-rank scaling on tail representations while AdamW shows weak scaling, even when perplexity is matched.
Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate cs.LG · 2026-05-20 · unverdicted · none · ref 25 · internal anchor
A framework quantifies hyperparameter transfer via scaling-law fit quality, extrapolation robustness, and loss penalty, with ablations showing that μP's advantage over standard parameterization stems from maximizing the embedding layer learning rate to avoid bottlenecks and instabilities in AdamW.
You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories cs.LG · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
RELEX extrapolates LLM checkpoints from short RLVR prefixes by projecting deltas onto a rank-1 subspace and fitting a linear trend, matching full training performance at 15% of the steps.
Towards Large Model Feature Coding cs.CV · 2026-05-20 · unverdicted · none · ref 70 · internal anchor
LaMoFCBench is a new benchmark covering 4 categories and 16 scenarios that exposes misalignment between mainstream feature codecs and the heterogeneous statistics of large-model activations.
Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory cs.CL · 2026-05-20 · unverdicted · none · ref 25 · internal anchor
Memory Grafting improves language-model benchmarks by grafting offline hidden-state memory from a larger model into a recipient model using n-gram lookups and lightweight adapters, outperforming MoE and vanilla Engram baselines at 0.92B and 2.8B scales.
HRM-Text: Efficient Pretraining Beyond Scaling cs.CL · 2026-05-20 · unverdicted · none · ref 1 · internal anchor
A 1B-parameter hierarchical recurrent model pretrained on 40B instruction-response tokens achieves 60.7% MMLU and strong results on ARC-C, DROP, GSM8K, and MATH while using 100-900x fewer tokens than standard baselines.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
A Tabular Schedule Abstraction for Communication-Aware Evaluation of Pipeline-Parallel LLM Training cs.DC · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
A new tabular abstraction for pipeline schedules shows communication can reverse rankings from bubble analysis alone, with GPipe and 1F1B runtime-equivalent but 1F1B lower in activation memory.
C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG cs.OS · 2026-05-19 · unverdicted · none · ref 28 · internal anchor
C2CServe is a request-granularity serverless LLM serving system that keeps weights in host memory and streams them via C2C to MIG instances, cutting cold-start latency up to 7.1x while preserving TTFT/TPOT under contention.
Generative Auto-Bidding with Unified Modeling and Exploration cs.AI · 2026-05-19 · unverdicted · none · ref 21 · internal anchor
GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.
Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency cs.CL · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
Factual recall quality in LLMs follows a sigmoid scaling law in the log-linear combination of model parameter count and topic frequency in training data, explaining 60% of variance across models and up to 94% within families.
HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation cs.LG · 2026-05-18 · unverdicted · none · ref 27 · 2 links · internal anchor
HypergraphFormer trains LLMs via supervised fine-tuning to generate hypergraph textual representations for floor plans, claiming better performance than raster or vector methods on RPLAN and a new out-of-distribution dataset while enabling arbitrary boundaries and high editability.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unverdicted · none · ref 77 · 2 links · internal anchor
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
SNLP: Layer-Parallel Inference via Structured Newton Corrections cs.LG · 2026-05-18 · unverdicted · none · ref 20 · 2 links · internal anchor
SNLP achieves up to 2.58x wall-clock speedup on 0.5B Transformers via architecture-specific Newton corrections (IDN/HCN) that enable layer-parallel inference while preserving perplexity in milder settings.
How Do Electrocardiogram Models Scale? cs.LG · 2026-05-17 · conditional · none · ref 11 · internal anchor
Empirical scaling study of ECG models finds SSL scales robustly while ResNets show 1.3-2.5x better parameter efficiency and SSL up to 16x better data efficiency than supervised baselines on out-of-distribution tasks.
Reasoning Can Be Restored by Correcting a Few Decision Tokens cs.AI · 2026-05-16 · conditional · none · ref 11 · internal anchor
Reasoning gaps between base LLMs and LRMs concentrate on ~8% of early planning tokens; intervening with the reasoning model only at high-disagreement positions recovers performance.
NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning cs.AI · 2026-05-16 · unverdicted · none · ref 9 · internal anchor
NeuroMAS reframes multi-agent language systems as neural architectures where LLM agents learn coordination via reinforcement learning rather than predefined roles.
UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models cs.LG · 2026-05-15 · unverdicted · none · ref 78 · internal anchor
UB-SMoE balances expert utilization in heterogeneous federated SMoE fine-tuning via Dynamic Modulated Routing and Universal Pseudo-Gradient, delivering up to 45% compute reduction and 8.7x performance gains for low-resource clients over prior LoRA-rank methods.
The Scaling Laws of Skills in LLM Agent Systems cs.CL · 2026-05-15 · unverdicted · none · ref 35 · internal anchor
Empirical analysis across 15 LLMs and 1,141 skills identifies a logarithmic routing decay law and a multiplicative execution law coupled by a single fitted slope parameter b that enables targeted library optimizations improving routing accuracy and downstream task pass rates.
Protocol-Aware Tokenization and Architecture Co-Design for Wireless Packet Foundation Models cs.NI · 2026-05-14 · unverdicted · none · ref 6 · internal anchor
Protocol-aware tokenization is the primary performance driver for wireless packet foundation models, delivering 32-point accuracy gains while architecture changes yield only 2 points in a controlled 2x2 comparison.
Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning cs.CV · 2026-05-14 · unverdicted · none · ref 22 · internal anchor
CLVR framework adds closed-loop visual verification, proxy prompt reinforcement learning, and delta-space weight merge to improve complex text-to-image generation over single-step or unverified multi-step baselines.
Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning cs.LG · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
In a structured-output NW matrix task, Transformers generalize fastest at intermediate dataset sizes while larger sets can accelerate memorization in partial-competence regimes.
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection cs.AI · 2026-05-13 · unverdicted · none · ref 20 · internal anchor
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
When is Warmstarting Effective for Scaling Language Models? cs.LG · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
A 2x growth factor in model warmstarting yields reliable training speedups for language models under 20 tokens/parameter budgets, with an empirical upper bound on effective growth factors.
CHAL: Council of Hierarchical Agentic Language cs.AI · 2026-05-12 · unverdicted · none · ref 85 · internal anchor
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 3 · 2 links · internal anchor
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving cs.RO · 2026-05-12 · unverdicted · none · ref 80 · 2 links · internal anchor
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 133 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction q-bio.GN · 2026-05-12 · unverdicted · none · ref 51 · internal anchor
Set-aggregated genome embeddings from genomic language models predict microbiome abundance profiles with improved generalization to novel genomes over classical bioinformatics methods.
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory cs.AI · 2026-05-12 · unverdicted · none · ref 87 · internal anchor
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and long-term agent benchmarks.

Scaling Laws for Neural Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer