One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Ciprian Chelba , Tomas Mikolov , Mike Schuster , Qi Ge , Thorsten Brants , Phillipp Koehn , Tony Robinson

Authors on Pith no claims yet

classification 💻 cs.CL

keywords languagebenchmarkbaselinedatamodelingtechniquesavailablebillion

read the original abstract

We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
cs.LG 2017-01 accept novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Infinite Mask Diffusion for Few-Step Distillation
cs.CL 2026-05 unverdicted novelty 7.0

Infinite Mask Diffusion Models use stochastic infinite-state masks to overcome the factorization error lower bound in standard masked diffusion, achieving superior few-step performance on language tasks via distillation.
Spherical Flows for Sampling Categorical Data
stat.ML 2026-05 unverdicted novelty 7.0

Spherical vMF flows for categorical sequences reduce the continuity equation to a scalar ODE in cosine similarity and yield posterior-weighted tangent velocities for improved ODE and PC sampling.
Spherical Flows for Sampling Categorical Data
stat.ML 2026-05 unverdicted novelty 7.0

Spherical vMF flows reduce the continuity equation on the sphere to a scalar ODE in cosine similarity, enabling posterior-weighted sampling of categorical sequences via cross-entropy trained posteriors.
LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling
cs.CL 2026-04 unverdicted novelty 7.0

LangFlow is the first continuous diffusion language model to rival discrete diffusion on perplexity and generative perplexity while exceeding autoregressive baselines on several zero-shot tasks.
Computational Lesions in Multilingual Language Models Separate Shared and Language-specific Brain Alignment
cs.CL 2026-04 unverdicted novelty 7.0

Lesioning a shared core in multilingual LLMs drops whole-brain fMRI encoding correlation by 60.32%, while language-specific lesions selectively weaken predictions only for the matched native language.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Fast Transformer Decoding: One Write-Head is All You Need
cs.NE 2019-11 unverdicted novelty 7.0

Multi-query attention shares keys and values across heads in Transformers, greatly reducing memory bandwidth for faster decoding with only minor quality loss.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
cs.CL 2018-04 unverdicted novelty 7.0

GLUE is a multi-task benchmark for general natural language understanding that includes a diagnostic test suite and finds limited gains from current multi-task learning methods over single-task training.
Deep Learning Scaling is Predictable, Empirically
cs.LG 2017-12 unverdicted novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
Pointer Sentinel Mixture Models
cs.CL 2016-09 conditional novelty 7.0

Pointer sentinel-LSTM mixes context copying with softmax prediction to reach 70.9 perplexity on Penn Treebank using fewer parameters than standard LSTMs.
Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation
cs.CL 2026-05 unverdicted novelty 6.0

Summing outputs from separately trained QLoRA PEFT modules provides strong performance for attribute-controlled text generation, often matching or exceeding single-task modules even on single-attribute tests.
How to Train Your Latent Diffusion Language Model Jointly With the Latent Space
cs.CL 2026-05 unverdicted novelty 6.0

Joint training of the latent space with the diffusion process produces a competitive latent diffusion language model that is faster than existing discrete and continuous diffusion baselines.
Language Diffusion Models are Associative Memories Capable of Retrieving Unseen Data
cs.LG 2026-04 unverdicted novelty 6.0

Uniform-based discrete diffusion models behave as associative memories that retrieve unseen data, with a dataset-size-driven memorization-to-generalization transition detectable via conditional entropy of token predictions.
Interpolating Discrete Diffusion Models with Controllable Resampling
cs.LG 2026-04 unverdicted novelty 6.0

IDDM interpolates diffusion transitions with a resampling mechanism to lessen dependence on intermediate latents and improve sample quality over masked and uniform discrete diffusion models.
Generative Frontiers: Why Evaluation Matters for Diffusion Language Models
cs.LG 2026-04 conditional novelty 6.0

Generative perplexity and entropy are shown to be the two additive components of KL divergence to a reference distribution, motivating generative frontiers as a principled evaluation method for diffusion language models.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
cs.CL 2023-06 unverdicted novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
cs.CL 2026-05 unverdicted novelty 5.0

Re-evaluating controlled text generation systems under standardized conditions reveals that many published performance claims do not hold, highlighting the need for consistent evaluation practices.
FastOmniTMAE: Parallel Clause Learning for Scalable and Hardware-Efficient Tsetlin Embeddings
cs.LG 2026-05 unverdicted novelty 4.0

FastOmniTMAE parallelizes clause learning in Tsetlin Machine autoencoders to achieve up to 5x faster training with comparable embedding quality and low-footprint FPGA deployment.