super hub Canonical reference

Scaling Laws for Neural Language Models

Benjamin Chess, Jared Kaplan, Rewon Child, Sam McCandlish, Tom B Brown, Tom Henighan · 2020 · cs.LG · arXiv 2001.08361

Canonical reference. 83% of citing Pith papers cite this work as background.

798 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 798 citing papers more from Benjamin Chess arXiv PDF

abstract

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 119 method 6 dataset 3 baseline 2 other 2

citation-polarity summary

background 110 unclear 8 use method 6 support 3 use dataset 3 baseline 2

claims ledger

abstract We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude. Other architectural details such as network width or depth have minimal effects within a wide range. Simple equations govern the dependence of overfitting on model/dataset size and the dependence of training speed on model size. These relationships allow us to determine the optimal allocation of a fixed compute budget. Larger models are s

authors

Benjamin Chess Jared Kaplan Rewon Child Sam McCandlish Tom B Brown Tom Henighan

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

cs.LG · 2026-05-22 · unverdicted · novelty 8.0

BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.

The Economics of Model Collapse: Equilibrium, Welfare, and Optimal Provenance Subsidies in Synthetic Data Markets

econ.GN · 2026-05-19 · unverdicted · novelty 8.0

Introduces the Synthetic Data Contamination Equilibrium and derives closed-form optimal provenance subsidies s* = KL(q||p)/(2 kappa) plus watermark strengths to mitigate model collapse, validated by OLS matching structural predictions on C4 data.

Quantum-enhanced Large Language Models on Quantum Hardware via Cayley Unitary Adapters

quant-ph · 2026-05-07 · unverdicted · novelty 8.0

Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.

Nearly Optimal Attention Coresets

cs.DS · 2026-05-07 · unverdicted · novelty 8.0

ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.

Efficient Training on Multiple Consumer GPUs with RoundPipe

cs.DC · 2026-04-29 · conditional · novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on 8x RTX 4090.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

cs.LG · 2026-04-03 · unverdicted · novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicting layer importance and enabling pruning gains of 1.1x-3.6x over Last-N baselines.

Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

cs.LG · 2026-02-18 · unverdicted · novelty 8.0

Grokking reflects escape from a metastable low-dimensional regime where transverse curvature accumulates before generalization, with subspace motion necessary but curvature boost insufficient.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

cs.LG · 2025-06-12 · unverdicted · novelty 8.0

Transformers perform kernel-based prediction for Hölder regression on manifolds and achieve intrinsic-dimension-dependent minimax rates with sufficient training tasks.

Privacy Amplification in Differentially Private Zeroth-Order Optimization with Hidden States

cs.LG · 2025-05-30 · unverdicted · novelty 8.0

Introduces hybrid noise and novel coupling analysis to achieve the first convergent hidden-state DP bound for zeroth-order optimization.

Large Language Diffusion Models

cs.CL · 2025-02-14 · unverdicted · novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

cs.CL · 2024-10-06 · unverdicted · novelty 8.0

ErrorRadar is a new benchmark of 2,500 multimodal K-12 math problems for MLLM error step identification and categorization, where GPT-4o trails human experts by ~10%.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

cs.CL · 2023-05-12 · conditional · novelty 8.0

Tiny language models under 10M parameters trained on a synthetic children's story dataset generate fluent, consistent, multi-paragraph English text with near-perfect grammar and reasoning.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Discovering Language Model Behaviors with Model-Written Evaluations

cs.CL · 2022-12-19 · unverdicted · novelty 8.0

Language models can automatically generate high-quality evaluation datasets that reveal new cases of inverse scaling, sycophancy, and concerning goal-seeking behaviors, including some worsened by RLHF.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

Agentic generation of verifiable rules for deterministic, self-expanding reaction classification

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Multi-agent LLMs generate and verify 14,073 deterministic reaction rules from 665,901 patents, enabling 97.7% classification of unseen reactions with finer resolution than fixed proprietary systems.

SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

SemRF supplies fixed semantic anchors and pseudo-inverse tying to produce stable coordinates for residual dynamics, Voronoi traces, and minimum-action canonical paths that link to parameter efficiency under controlled interface error.

CORTEX: High-Quality Cross-Domain Organization of Web-Scale Corpora through Ontological Corpus Graph

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Cortex uses an Ontological Corpus Graph to structure web-scale corpora, creating a refined 24.14B-token corpus and a new benchmark validated on eight LLMs.

Smooth Scaling Laws Hide Stepwise Token Learning

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

Token loss trajectories follow localized sigmoids whose learning-time spectrum quantitatively reconstructs scaling-law derivatives on T, D, and M axes and enables faster training via distribution reshaping.

Optimizer Memory Makes Shuffle Order a First-Order Source of Fine-Tuning Noise

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Fixed-clock optimizer memory turns equal-multiset data shuffle order into an O(η) source of fine-tuning noise, larger than the O(η²) effect in memoryless cases, with a fit-free sizing method derived.

citing papers explorer

Showing 50 of 798 citing papers.

Quantifying Prior Dominance in RAG Systems cs.CL · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
Introduces NCU metric using token log-probabilities and finds small language models match or outperform larger ones in strict factual RAG extraction, while commercial APIs show high prior dominance and negative transfer.
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate cs.MA · 2026-04-29 · unverdicted · none · ref 8 · internal anchor
Homogeneous multi-agent debate introduces sycophantic conformity, contextual fragility, and consensus collapse, leading to equal or lower accuracy than isolated self-correction at 2.1-3.4x higher token cost on GSM-Hard and MMLU-Hard.
An Empirical Study of Speculative Decoding on Software Engineering Tasks cs.SE · 2026-04-29 · unverdicted · none · ref 30 · internal anchor
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
Optimizing ground state preparation protocols with autoresearch quant-ph · 2026-04-28 · unverdicted · none · ref 93 · 2 links · internal anchor
AI coding agents evolve simple ground-state protocols into improved versions for VQE, DMRG, and AFQMC on spin models and molecules by using executable energy scores under fixed compute budgets.
Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity cs.LG · 2026-04-27 · unverdicted · none · ref 10 · internal anchor
Incompressible Knowledge Probes enable log-linear estimation of LLM parameter counts from factual accuracy on obscure questions, showing continued scaling of knowledge capacity across open and closed models.
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective cs.CL · 2026-04-25 · conditional · none · ref 34 · 2 links · internal anchor
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models cs.LG · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
A fitted iso-depth scaling law measures that one recurrence in looped transformers is worth r^0.46 unique blocks in validation loss.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling cs.LG · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction cs.IR · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 11 · internal anchor
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Rethinking Dataset Distillation: Hard Truths about Soft Labels cs.LG · 2026-04-20 · conditional · none · ref 18 · internal anchor
Soft labels hide the value of high-quality data subsets in dataset distillation, and a new compute-aware method outperforms existing approaches in hard-label settings on ImageNet-1K.
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion cs.CL · 2026-04-20 · unverdicted · none · ref 6 · internal anchor
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
Causal inference for social network formation econ.EM · 2026-04-20 · conditional · none · ref 84 · internal anchor
Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys cs.AI · 2026-04-19 · unverdicted · none · ref 24 · internal anchor
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
Understanding and Enforcing Weight Disentanglement in Task Arithmetic cs.AI · 2026-04-18 · unverdicted · none · ref 20 · internal anchor
Task-Feature Specialization explains weight disentanglement in task arithmetic and leads to orthogonality, which OrthoReg enforces to enhance performance of model composition methods.
Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization cs.CR · 2026-04-16 · unverdicted · none · ref 22 · internal anchor
R²A uses a hybrid ensemble surrogate router and suffix optimization to significantly increase black-box LLM router selection of expensive models across query distributions.
DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines cs.CV · 2026-04-15 · unverdicted · none · ref 21 · internal anchor
DharmaOCR models reach 0.925 and 0.911 extraction scores with 0.40% and 0.20% degeneration rates on a new benchmark covering printed, handwritten, and legal documents, outperforming open-source and commercial baselines via SFT plus DPO.
Presynthesis: Towards Scaling Up Program Synthesis with Finer-Grained Abstract Semantics cs.PL · 2026-04-14 · unverdicted · none · ref 26 · internal anchor
Presynthesis constructs a tree automaton and oracle offline to allow efficient use of fine-grained abstract semantics for pruning in search-based program synthesis.
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size cs.CL · 2026-04-14 · unverdicted · none · ref 10 · internal anchor
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
Robust Explanations for User Trust in Enterprise NLP Systems cs.CL · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
Decoder LLMs produce substantially more stable explanations than encoder models, with 73% lower top-token flip rates on average and further 44% gains from 7B to 70B scale.
STORM: End-to-End Referring Multi-Object Tracking in Videos cs.CV · 2026-04-12 · unverdicted · none · ref 30 · internal anchor
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
AI Achieves a Perfect LSAT Score cs.AI · 2026-04-11 · unverdicted · none · ref 2 · internal anchor
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems cs.IR · 2026-04-10 · unverdicted · none · ref 18 · internal anchor
IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance and transferability.
Exponential quantum advantage in processing massive classical data quant-ph · 2026-04-08 · unverdicted · none · ref 42 · internal anchor
A polylog-sized quantum computer achieves exponential advantage over classical machines in classification and dimension reduction of massive classical data using quantum oracle sketching combined with classical shadows.
The Shrinking Lifespan of LLMs in Science cs.DL · 2026-04-08 · unverdicted · none · ref 10 · internal anchor
LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.
Discrete Prototypical Memories for Federated Time Series Foundation Models cs.LG · 2026-04-06 · unverdicted · none · ref 12 · internal anchor
FeDPM learns and aligns local discrete prototypical memories across domains to create a unified discrete latent space for LLM-based time series foundation models in a federated setting.
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation cs.IR · 2026-04-04 · accept · none · ref 26 · internal anchor
Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving cs.DC · 2026-03-26 · unverdicted · none · ref 12 · internal anchor
GhostServe applies erasure coding to KV cache in host memory for fast recovery from failures in LLM serving, cutting checkpointing latency up to 2.7x and recovery latency 2.1x versus prior methods.
SeaAlert: Critical Information Extraction From Maritime Distress Communications with Large Language Models cs.CL · 2026-03-23 · unverdicted · none · ref 33 · internal anchor
SeaAlert generates synthetic noisy maritime distress transcripts via LLM and ASR simulation to train robust extraction of critical information from real VHF communications.
Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing cs.CL · 2026-03-20 · conditional · none · ref 22 · internal anchor
Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.
Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models cs.CL · 2026-03-13 · unverdicted · none · ref 13 · internal anchor
The paper introduces the Turkic Transfer Coefficient (TTC) as a theoretical measure of transfer potential and a scaling model linking adaptation performance to model capacity, data size, and adaptation module expressivity in Turkic languages.
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes cs.CV · 2026-03-10 · unverdicted · none · ref 18 · internal anchor
Panorama-Language Models with a sparse attention module and PanoVQA dataset deliver superior holistic reasoning on 360° adverse omni-scenes compared to stitched pinhole views.
Scaling Laws for Cross-Encoder Reranking cs.IR · 2026-03-05 · unverdicted · none · ref 23 · internal anchor
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure cs.LG · 2026-02-19 · unverdicted · none · ref 3 · internal anchor
Multi-task grokking in Transformers produces staggered generalization, low-dimensional manifolds, weight-decay phase structure, holographic solutions, and transverse redundancy.
Theory of Optimal Learning Rate Schedules and Scaling Laws for a Random Feature Model cond-mat.dis-nn · 2026-02-04 · unverdicted · none · ref 10 · internal anchor
In a random feature model, optimal SGD learning-rate schedules are polynomial decay in the easy phase and warmup-stable-decay in the hard phase, outperforming constant or simple power-law schedules and transferring differently across training horizons.
Imposing Boundary Conditions on Neural Operators via Learned Function Extensions cs.LG · 2026-02-04 · unverdicted · none · ref 14 · internal anchor
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
Mixture-of-Top-k Attention: Efficient Attention via Scalable Fast Weights cs.LG · 2026-02-01 · unverdicted · none · ref 7 · internal anchor
MiTA makes attention scalable by gathering query-aware top-k key-value pairs through landmarks as deformable routed experts and compressing the N-width fast-weight MLP into a shared narrower expert.
Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads cs.LG · 2026-01-29 · unverdicted · none · ref 4 · internal anchor
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
OmniMol: Transferring Particle Physics Knowledge to Molecular Dynamics with Point-Edge Transformers physics.chem-ph · 2026-01-15 · unverdicted · none · ref 47 · internal anchor
OmniMol transfers a billion-jet pre-trained PET foundation model from HEP to molecular dynamics via an interaction-matrix attention bias, delivering strong performance on the oMol dataset with minimal fine-tuning and fast inference.
TS-Arena -- A Live Forecast Pre-Registration Platform cs.LG · 2025-12-23 · conditional · none · ref 18 · internal anchor
TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.
Shrinkage to Infinity: Reducing Test Error by Inflating the Minimum Norm Interpolator in Linear Models math.ST · 2025-10-22 · unverdicted · none · ref 5 · internal anchor
Inflating the min-norm interpolator by a factor >1 reduces generalization error in linear regression with anisotropic covariances when d/n diverges to infinity.
The Art of Scaling Reinforcement Learning Compute for LLMs cs.LG · 2025-10-15 · unverdicted · none · ref 8 · internal anchor
A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.
Scaling Vision Transformers for Functional MRI with Flat Maps cs.CV · 2025-10-15 · conditional · none · ref 54 · internal anchor
CortexMAE adapts Vision Transformers to fMRI via cortical flat maps, shows power-law scaling on 2.1K hours of data, and outperforms priors on cognitive state decoding while failing to beat a simple functional connectivity baseline on subject-level trait prediction.
Less is More: Recursive Reasoning with Tiny Networks cs.LG · 2025-10-06 · unverdicted · none · ref 8 · internal anchor
TRM with 7M parameters achieves 45% accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, surpassing most LLMs with under 0.01% of their parameters.
SynBench: A Benchmark for Differentially Private Text Generation cs.AI · 2025-09-18 · conditional · none · ref 26 · internal anchor
SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.
From Membership-Privacy Leakage to Quantum Machine Unlearning quant-ph · 2025-09-07 · unverdicted · none · ref 31 · internal anchor
Quantum neural networks exhibit membership privacy leakage that a proposed quantum machine unlearning framework with three mechanisms can mitigate in simulations and cloud device tests.
Pre-trained Large Language Models Learn Hidden Markov Models In-context cs.LG · 2025-06-08 · unverdicted · none · ref 25 · internal anchor
Pre-trained LLMs learn to predict HMM-generated sequences via in-context learning, approaching theoretical optimum on synthetic HMMs and matching expert models on real animal decision data.
High-Dimensional Private Linear Regression with Optimal Rates stat.ML · 2025-05-22 · accept · none · ref 16 · internal anchor
DP-GD achieves minimax optimal non-asymptotic risk O(γ + γ²/ρ²) for well-conditioned high-dimensional data and power-law scaling for ill-conditioned power-law spectra, with the exponent depending on the privacy parameter ρ.
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference cs.DC · 2025-05-16 · conditional · none · ref 5 · internal anchor
TokenWeave achieves up to 1.28x lower latency and 1.19x higher throughput for distributed LLM inference by enabling compute-communication overlap at small token counts via a fused AllReduce-RMSNorm kernel that uses only 2-8 SMs.
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models cs.CL · 2025-04-29 · unverdicted · none · ref 35 · internal anchor
The authors generate and publicly release the first large-scale open dataset of three million structured moral fables produced by small open language models together with a reproducible LLM-judge evaluation pipeline.

Scaling Laws for Neural Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer