arxiv: 2101.00027 · v1 · submitted 2020-12-31 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao , Stella Biderman , Sid Black , Laurence Golding , Travis Hoppe , Charles Foster , Jason Phang , Horace He

show 4 more authors

Anish Thite Noa Nabeshima Shawn Presser Connor Leahy

Authors on Pith no claims yet

Pith reviewed 2026-05-10 21:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords language modelingdataset constructiondiverse text corpusThe PileCommon Crawlcross-domain generalizationGPT models

0 comments

The pith

A new 825-gigabyte dataset built from 22 diverse text sources trains language models that generalize better across domains than those trained on raw web crawls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 high-quality subsets that include academic papers, books, code, and other professional materials. It shows that existing models such as GPT-2 and GPT-3 struggle on many of these components, especially academic writing. Models trained on The Pile outperform those trained on Raw Common Crawl and CC-100 across every component of the Pile and on downstream evaluations. The authors also release the construction code and document aspects of the data that may concern users.

Core claim

The central claim is that training large language models on this composite dataset of 22 diverse subsets produces better cross-domain knowledge and downstream generalization than training on less curated web data. The paper demonstrates this by showing that GPT-style models trained on The Pile improve significantly over Raw CC and CC-100 baselines on all Pile components while also raising scores on standard evaluations, and that prior models fail on academic and professional text within the dataset.

What carries the argument

The Pile, a composite 825 GiB corpus constructed by combining 22 existing and newly assembled high-quality text subsets drawn from academic and professional sources.

Load-bearing premise

The reported performance gains are caused by the diversity and quality of the 22 subsets rather than by uncontrolled differences in training procedure, model scale, or data volume between the Pile-trained models and the Raw CC or CC-100 baselines.

What would settle it

A controlled retraining experiment that matches data volume, model size, and training steps exactly between a Pile-trained model and a Raw CC model, then evaluates both on held-out samples from every Pile component, would show whether the gains persist.

read the original abstract

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces The Pile, an 825 GiB English text corpus assembled from 22 diverse high-quality subsets (both existing and newly constructed, many from academic/professional sources) for training large-scale language models. It reports that untuned GPT-2 and GPT-3 models struggle on several Pile components (e.g., academic writing), while models trained on the Pile outperform Raw CC and CC-100 baselines on all Pile components and on downstream evaluations. The authors include an exploratory analysis of potential data issues and release the construction code publicly.

Significance. If the reported gains hold under controlled conditions, the work supplies a large, publicly documented, and diverse training resource that can improve cross-domain generalization in language models. The open release of construction code is a concrete strength that supports reproducibility and community use.

major comments (2)

[Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.
[Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.

minor comments (2)

[Dataset construction section] Dataset construction section: the 22 subsets would benefit from a single consolidated table listing exact sizes, sources, and preprocessing steps for each component to improve clarity and ease of replication.
[Exploratory analysis] Exploratory analysis: some figures showing data characteristics (e.g., domain distributions or token statistics) could include more precise axis labels and legends for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. The points raised regarding experimental controls and statistical rigor are important for strengthening the presentation of our results. We address each major comment below and describe the revisions we will incorporate in the updated version of the paper.

read point-by-point responses

Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the central claim that 'models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile' is load-bearing for the paper's contribution, yet the manuscript supplies no details on whether the Pile-trained GPT-2/GPT-3 variants and the Raw CC/CC-100 baselines were trained from scratch with identical architecture, token budget, optimizer, learning-rate schedule, or total compute. Without matched controls, performance deltas cannot be isolated to dataset properties.

Authors: We appreciate the referee drawing attention to the need for explicit documentation of the training controls. The original manuscript described the overall training setup in Section 4 but did not sufficiently emphasize the matched conditions across datasets. In the revised manuscript we have expanded the training details subsection to state explicitly that the GPT-2-scale and GPT-3-scale models trained on The Pile and the corresponding Raw CC and CC-100 baselines were all trained from scratch using identical model architectures, the same total token budget (approximately 300 billion tokens), the same Adam optimizer with identical hyperparameters, the same learning-rate schedule including warmup and cosine decay, and equivalent total compute. A table summarizing the shared hyperparameters has been added for clarity. These controls ensure that observed differences can be attributed to dataset properties rather than training discrepancies. revision: yes
Referee: [Abstract and evaluation/results section] Abstract and evaluation/results section: the claim of 'significant' improvement is presented without statistical tests, error bars, or variance estimates across runs, leaving the strength of the evidence for downstream-task gains partially unverified.

Authors: We agree that the strength of the 'significant' claim would benefit from additional statistical support. In the revised manuscript we have added error bars to the downstream-task figures, derived from multiple runs with different random seeds for the smaller model scales where compute permitted. We have also included the results of paired statistical tests (t-tests) on the key downstream benchmarks comparing Pile-trained models to the CC baselines. For the per-component Pile evaluations we now report standard deviations across model sizes. Due to the high computational cost of full-scale retraining we were limited in the number of replicate runs; however, the consistent direction and magnitude of gains across scales provide supporting evidence. The abstract and results section have been updated to reflect these additions. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical dataset construction and comparisons are self-contained

full rationale

The paper constructs the Pile dataset from 22 subsets and reports empirical results showing improved performance of GPT-2/GPT-3 models trained on it versus Raw CC and CC-100 baselines on Pile components and downstream tasks. No derivation chain, equations, predictions, or first-principles results exist that could reduce to inputs by construction. The patterns of self-definitional claims, fitted inputs called predictions, self-citation load-bearing arguments, uniqueness theorems, ansatz smuggling, or renaming known results are absent. The central claims rest on new data assembly and direct comparisons against external benchmarks, with no self-referential reductions or load-bearing self-citations that collapse the argument.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical dataset construction and benchmarking effort; it introduces no mathematical free parameters, domain axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5484 in / 1150 out tokens · 77761 ms · 2026-05-10T21:28:45.866889+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 8.0

Certain transformer architectures lose internal linear signals for decision quality during training, making observability an architecture-dependent property rather than a universal one.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
cs.LG 2023-12 unverdicted novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
cs.CL 2022-02 accept novelty 8.0

Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
cs.LG 2026-05 unverdicted novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 7.0

Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.
DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
cs.CV 2026-05 unverdicted novelty 7.0

DistractMIA performs output-only black-box membership inference on vision-language models by inserting semantic distractors and measuring shifts in generated text responses.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery
cs.LG 2026-05 conditional novelty 7.0

fmxcoders improve cross-layer feature recovery in transformers via factorized weights and layer masking, delivering 10-30 point probing F1 gains, 25-50% lower MSE, doubled functional coherence, and 3-13x more coherent...
Beyond Indistinguishability: Measuring Extraction Risk in LLM APIs
cs.CR 2026-04 unverdicted novelty 7.0

Indistinguishability-based privacy is incomparable to extractability in LLMs, and a new (l, b)-inextractability definition with rank-based bounds provides a tighter measure of extraction risk than prior proxies.
Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
cs.CR 2026-04 unverdicted novelty 7.0

A Merkle-committed SAE feature-trace protocol detects model substitutions in hosted LLMs at a stable threshold where parallel-probe baselines fail, including against adaptive LoRA attackers.
When Flat Minima Fail: Characterizing INT4 Quantization Collapse After FP32 Convergence
cs.LG 2026-04 conditional novelty 7.0

FP32-converged language models enter a post-convergence phase where INT4 quantization error explodes while FP32 perplexity remains stable, with onset tied to fine convergence rather than learning rate decay.
What Makes a Good Response? An Empirical Analysis of Quality in Qualitative Interviews
cs.CL 2026-04 unverdicted novelty 7.0

Direct relevance to a key research question is the strongest predictor of a response's contribution to qualitative study findings, while clarity and surprisal-based informativeness are not predictive.
Detection Without Correction: A Robust Asymmetry in Activation-Based Hallucination Probing
cs.CL 2026-03 conditional novelty 7.0

Activation probes detect hallucinations pre-generation in large LLMs but cannot correct them via steering, with output confidence outperforming on accuracy.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
cs.LG 2024-05 unverdicted novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
cs.CL 2024-05 unverdicted novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Chronos: Learning the Language of Time Series
cs.LG 2024-03 conditional novelty 7.0

Chronos pretrains transformer models on tokenized time series to deliver strong zero-shot forecasting across diverse domains.
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
cs.CL 2024-02 unverdicted novelty 7.0

LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
Extending Context Window of Large Language Models via Positional Interpolation
cs.CL 2023-06 conditional novelty 7.0

Position Interpolation linearly down-scales position indices to extend RoPE context windows to 32768 tokens with 1000-step fine-tuning, delivering strong long-context results on LLaMA 7B-65B while preserving short-con...
RWKV: Reinventing RNNs for the Transformer Era
cs.CL 2023-05 unverdicted novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Eliciting Latent Predictions from Transformers with the Tuned Lens
cs.LG 2023-03 accept novelty 7.0

Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Quantifying Memorization Across Neural Language Models
cs.LG 2022-02 unverdicted novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
cs.CV 2021-11 unverdicted novelty 7.0

LAION-400M is a publicly released open dataset of 400 million CLIP-filtered image-text pairs with embeddings and kNN indices for efficient search.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves the same O(ε^{-2}) sample complexity as exact single-loop SBO methods while cutting peak memory by up to 44.9% through randomized subspaces and bias-corrected Hessian estimation.
BROS: Bias-Corrected Randomized Subspaces for Memory-Efficient Single-Loop Bilevel Optimization
cs.LG 2026-05 unverdicted novelty 6.0

BROS achieves memory-efficient single-loop stochastic bilevel optimization with O(ε^{-2}) sample complexity by performing updates in randomized subspaces and using Rademacher bi-probe correction for unbiased estimation.
NCO: A Versatile Plug-in for Handling Negative Constraints in Decoding
cs.CL 2026-05 unverdicted novelty 6.0

NCO enables efficient online pattern matching for negative hard and regex constraints in LLM decoding to prevent forbidden content without state explosion.
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
TextLDM: Language Modeling with Continuous Latent Diffusion
cs.CL 2026-05 unverdicted novelty 6.0

TextLDM applies DiT-style latent diffusion with flow matching to language modeling via a REPA-aligned VAE, outperforming prior diffusion LMs and matching GPT-2 when trained from scratch on OpenWebText2.
HexiSeq: Accommodating Long Context Training of LLMs over Heterogeneous Hardware
cs.DC 2026-05 unverdicted novelty 6.0

HexiSeq optimizes sequence and head partitioning across mixed GPUs to improve long-context LLM training throughput by up to 1.72x in simulations.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
ZAYA1-8B Technical Report
cs.AI 2026-05 unverdicted novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Feature Starvation as Geometric Instability in Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Adaptive elastic net SAEs (AEN-SAEs) mitigate feature starvation in SAEs by combining ℓ2 structural stability with adaptive ℓ1 reweighting, producing a Lipschitz-continuous sparse coding map that recovers global featu...
Adaptive Inverted-Index Routing for Granular Mixtures-of-Experts
cs.LG 2026-05 unverdicted novelty 6.0

AIR-MoE introduces a two-stage inverted-index routing method based on vector quantization that approximates optimal expert selection for granular MoE models at lower cost and with empirical performance gains.
Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
cs.CV 2026-05 unverdicted novelty 6.0

A video transfer pipeline augments simulated VLA data into realistic videos while preserving actions, yielding consistent performance gains on robot benchmarks such as 8% on Robotwin 2.0.
Finite-Size Gradient Transport in Large Language Model Pretraining: From Cascade Size to Intensive Transport Efficiency
cs.LG 2026-05 unverdicted novelty 6.0

A gradient-transport framework with observables D, z, β, δ, v_rel applied to Pico-LM and Pythia datasets shows distinct scaling regimes in duration and efficiency while sharing a near-unity cascade-size backbone.
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
cs.AI 2026-05 unverdicted novelty 6.0

NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...
Probe-Geometry Alignment: Erasing the Cross-Sequence Memorization Signature Below Chance
cs.LG 2026-05 unverdicted novelty 6.0

Probe-geometry alignment erases cross-sequence memorization signatures in LLMs below chance using per-depth rank-one activation interventions with negligible impact on zero-shot capabilities.
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
cs.CL 2026-05 unverdicted novelty 6.0

Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
A Limit Theory of Foundation Models: A Mathematical Approach to Understanding Emergent Intelligence and Scaling Laws
cs.LG 2026-04 unverdicted novelty 6.0

Emergent intelligence is recast as the existence of the limit of performance E(N,P,K) as N,P,K to infinity, with necessary and sufficient conditions derived via nonlinear Lipschitz operator theory and scaling laws obt...
Architecture Determines Observability of Transformers
cs.LG 2026-04 unverdicted novelty 6.0

Architecture and training determine whether transformers retain a readable internal signal that lets activation monitors catch errors missed by output confidence.
PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning
cs.LG 2026-04 unverdicted novelty 6.0

PrivUn shows privacy unlearning in LLMs produces gradient-driven ripple effects and only shallow forgetting across layers, with new strategies proposed for deeper removal.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints
cs.AI 2026-04 unverdicted novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Improving Robustness In Sparse Autoencoders via Masked Regularization
cs.LG 2026-04 unverdicted novelty 6.0

Masked regularization in sparse autoencoders disrupts token co-occurrences to reduce feature absorption, enhance probing, and narrow OOD gaps across architectures and sparsity levels.
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Geometric Limits of Knowledge Distillation: A Minimum-Width Theorem via Superposition Theory
cs.LG 2026-04 conditional novelty 6.0

Student networks are limited to d_S * g(α) features via superposition, creating a permanent importance-weighted loss floor in distillation that cannot be overcome by training.
Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
cs.LG 2026-03 conditional novelty 6.0

Nonlinear query projections of the form X + MLP(X) improve transformer performance on small models with only d² + O(d) added parameters.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
cs.CL 2024-06 unverdicted novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
cs.LG 2024-03 unverdicted novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
cs.RO 2024-03 accept novelty 6.0

DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
Efficient Streaming Language Models with Attention Sinks
cs.CL 2023-09 accept novelty 6.0

StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
Retentive Network: A Successor to Transformer for Large Language Models
cs.CL 2023-07 unverdicted novelty 6.0

RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
cs.CL 2023-06 unverdicted novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

Reference graph

Works this paper leans on

193 extracted references · 193 canonical work pages · cited by 81 Pith papers · 11 internal anchors

[1]

Sony corp

1984. Sony corp. of america v. universal city studios, inc

work page 1984
[2]

2003. Kelly v. arriba soft corp

work page 2003
[3]

Righthaven llc v

2013. Righthaven llc v. hoehn

work page 2013
[5]

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining . In LREC. European Language Resources Association

work page 2010
[6]

Emily M Bender and Batya Friedman. 2018. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587--604

work page 2018
[7]

Stella Biderman. 2021. Data statement for the P ile. arXiv preprint arXiv

work page 2021
[8]

Stella Biderman, Kieran Bicheno, and Leo Gao. 2021. Datasheet for the P ile. arXiv preprint arXiv

work page 2021
[9]

Scheirer

Stella Biderman and Walter J. Scheirer. 2020. Pitfalls in machine learning research: Reexamining the development cycle. NeurIPS ``I Can't Believe It's Not Better!'' Workshop

work page 2020
[10]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993--1022

work page 2003
[12]

Nick Bostrom. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Inc

work page 2014
[13]

Nick Bostrom and Eliezer Yudkowsky. 2014. The ethics of artificial intelligence. The Cambridge handbook of artificial intelligence, 1:316--334

work page 2014
[16]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. http://arxiv.org/abs/2012.07805 Extracting training data from large language models

work page arXiv 2020
[18]

Brian Christian. 2020. The Alignment Problem: Machine Learning and Human Values. WW Norton & Company

work page 2020
[19]

Alina Maria Ciobanu, Liviu P Dinu, and Andrea Sgarro. 2017. Towards a map of the syntactic similarity of languages. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 576--590. Springer

work page 2017
[20]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://www.aclweb.org/anthology/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for...

work page 2020
[22]

Andrew Critch and David Krueger. 2020. AI Research Considerations for Human Existential Safety (ARCHES) . Preprint at acritch.com/arches http://acritch.com/arches

work page 2020
[24]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Com...

work page 2019
[25]

István Endrédy and Attila Novák. 2013. More effective boilerplate removal – the GoldMiner algorithm. In Polibits

work page 2013
[26]

Niels Ferguson and Bruce Schneier. 2003. Practical Cryptography. John Wiley & Sons

work page 2003
[27]

Casey Fiesler, Nathan Beard, and Brian C Keegan. 2020. No robots, spiders, or scrapers: Legal and ethical regulation of data collection methods in social media terms of service. In Proceedings of the International AAAI Conference on Web and Social Media, volume 14, pages 187--196

work page 2020
[29]

Aaron Gokaslan and Vanya Cohen. 2019. Openwebtext corpus. http://Skylion007.github.io/OpenWebTextCorpus

work page 2019
[30]

Authors Guild v. Google. 2015. . Docket No. 13-4829-cv, 804:202

work page 2015
[31]

Katja Grace, John Salvatier, Allan Dafoe, Baobao Zhang, and Owain Evans. 2018. When will AI exceed human performance? evidence from AI experts. Journal of Artificial Intelligence Research, 62:729--754

work page 2018
[32]

David Graff, Junbo Kong, Ke Chen, and Kazuaki Maeda. 2003. English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34

work page 2003
[33]

Declan Groves and Andy Way. 2006. Hybridity in mt: Experiments on the Europarl corpus. In Proceeedings of the 11th Annual conference of the European Association for Machine Translation (EAMT 2006)

work page 2006
[34]

Alexander Halavais. 2019. Overcoming terms of service: a proposal for ethical distributed research. Information, Communication & Society, 22(11):1567--1581

work page 2019
[35]

Chris Hardin. 2018. https://blog.janestreet.com/how-to-shuffle-a-big-dataset/ How to shuffle a big dataset

work page 2018
[37]

Matthew Hoffman, Francis Bach, and David Blei. 2010. Online learning for latent dirichlet allocation. advances in neural information processing systems, 23:856--864

work page 2010
[38]

Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591--598

work page 2016
[39]

Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social biases in NLP models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813

work page arXiv 2020
[40]

Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: S trategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 306--316

work page 2020
[42]

Bryan Klimt and Yiming Yang. 2004. The E nron corpus: A new dataset for email classification research. In European Conference on Machine Learning, pages 217--226. Springer

work page 2004
[43]

Sosuke Kobayashi. 2018. Homemade bookcorpus. https://github.com/BIGBALLON/cifar-10-cnn

work page 2018
[44]

Philipp Koehn. 2005. Europarl : A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79--86. Citeseer

work page 2005
[47]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[48]

Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit . In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, pages 62--69. Somerset, NJ: Association for Computational Linguistics. http://arXiv.org/abs/cs/0205028

work page internal anchor Pith review arXiv 2002
[49]

2006--2020

John MacFarlane. 2006--2020. https://pandoc.org/ Pandoc: a universal document converter

work page 2006
[50]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf Distributed representations of words and phrases and their compositionality . In Advances in Neural Information Processing Systems, volume 26, pages 3111--3119. Curran Associates, Inc

work page 2013
[51]

Jonathan A Obar. 2020. Sunlight alone is not a disinfectant: Consent and the futility of opening big data black boxes (without assistance). Big Data & Society, 7(1):2053951720935615

work page 2020
[52]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. http://www.aclweb.org/anthology/D14-1162 Glove: Global vectors for word representation . In Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543

work page 2014
[54]

Shawn Presser. 2020. Books3. https://twitter.com/theshawwn/status/1320282149329784833

work page arXiv 2020
[55]

Alec Radford, Karthik Narasimhan, Time Salimans, and Ilya Sutskever. 2018. Improving language understanding with unsupervised learning. Technical report, OpenAI

work page 2018
[56]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9

work page 2019
[57]

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. 2019. https://arxiv.org/abs/1911.05507 Compressive transformers for long-range sequence modelling . arXiv preprint

work page arXiv 2019
[59]

Inioluwa Deborah Raji and Jingying Yang. 2019. ABOUT ML : Annotation and benchmarking on understanding and transparency of machine learning lifecycles. arXiv preprint arXiv:1912.06166

work page arXiv 2019
[60]

Radhakrishna Rao

C. Radhakrishna Rao. 1961. http://www.jstor.org/stable/25049166 Generation of random permutations of given number of elements using random sampling numbers . Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 23(3):305--307

work page arXiv 1961
[61]

Radim Rehurek, Petr Sojka, et al. 2011. Gensim—statistical semantics in python. NLP Centre, Faculty of Informatics, Masaryk University

work page 2011
[62]

C Rosset. 2019. Turing-NLG : A 17-billion-parameter language model by M icrosoft. Microsoft Blog

work page 2019
[63]

S. Russell. 2019. https://books.google.de/books?id=M1eFDwAAQBAJ Human Compatible: Artificial Intelligence and the Problem of Control . Penguin Publishing Group

work page 2019
[66]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM : Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2019
[67]

Carl Shulman and Nick Bostrom. 2020. Sharing the world with digital minds. preprint

work page 2020
[68]

Kaj Sotala and Lukas Gloor. 2017. Superintelligence as a cause or cure for risks of astronomical suffering. Informatica, 41(4)

work page 2017
[70]

Merity Stephen, Xiong Caiming, Bradbury James, and Richard Socher. 2016

work page 2016
[71]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33

work page 2020
[72]

Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 a . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache

work page 2019
[73]

Pedro Javier Ortiz Su \'a rez, Beno \^ t Sagot, and Laurent Romary. 2019 b . Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut f \"u r Deutsche Sprache

work page 2019
[74]

Anja Thieme, Danielle Belgrave, and Gavin Doherty. 2020. Machine learning in mental health: A systematic review of the HCI literature to support the development of effective and implementable ML systems. ACM Transactions on Computer-Human Interaction (TOCHI), 27(5):1--53

work page 2020
[75]

Tiedemann

J. Tiedemann. 2016. Finding alternative translations in a large corpus of movie subtitles. Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016)

work page 2016
[76]

Trinh and Quoc V

Trieu H. Trinh and Quoc V. Le. 2018. http://arxiv.org/abs/1806.02847 A simple method for commonsense reasoning . CoRR, abs/1806.02847

work page arXiv 2018
[77]

Hans Van Halteren. 2008. Source language markers in Europarl translations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 937--944

work page 2008
[78]

Jessica Vitak, Katie Shilton, and Zahra Ashktorab. 2016. Beyond the Belmont principles: Ethical challenges, practices, and beliefs in the online data research community. In Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, pages 941--953

work page 2016
[80]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. https://www.aclweb.org/a...

work page 2020
[82]

Eliezer Yudkowsky. 2013. Intelligence explosion microeconomics. Machine Intelligence Research Institute, accessed online October, 23:2015

work page 2013
[83]

Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. http://papers.nips.cc/paper/9106-defending-against-neural-fake-news.pdf Defending against neural fake news . In H. Wallach, H. Larochelle, A. Beygelzimer, F. d\' Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pro...

work page 2019
[84]

Victor Zhou. 2019. Building a better profanity detection library with scikit-learn

work page 2019
[85]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19--27

work page 2015
[86]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , journal=. The

work page
[87]

Datasheet for the

Biderman, Stella and Bicheno, Kieran and Gao, Leo , journal=. Datasheet for the

work page
[88]

Data Statement for the

Biderman, Stella , journal=. Data Statement for the

work page
[89]

OpenAI Blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI Blog , volume=

work page
[90]

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. arXiv preprint arXiv:1910.10683 , year=

work page internal anchor Pith review arXiv 1910
[91]

Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan , journal=

work page
[92]

Rosset, C , journal=

work page
[93]

Language Models are Few-Shot Learners

Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[94]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Gshard: Scaling giant models with conditional computation and automatic sharding , author=. arXiv preprint arXiv:2006.16668 , year=

work page internal anchor Pith review arXiv 2006
[95]

Technical report, OpenAI , year=

Improving language understanding with unsupervised learning , author=. Technical report, OpenAI , year=

work page
[96]

BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019

work page 2019
[97]

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke Zettlemoyer and Veselin Stoyanov , journal=

work page
[98]

KONVENS , year=

Generic Web Content Extraction with Open-Source Software , author=. KONVENS , year=

work page
[99]

Polibits , year =

István Endrédy and Attila Novák , title =. Polibits , year =

work page

Showing first 80 references.